# matroid_semibandits_in_sublinear_time__d25088b9.pdf

Matroid Semi-Bandits in Sublinear Time

Ruo-Chun Tzeng 1 Naoto Ohsaka 2 Kaito Ariu 2

We study the matroid semi-bandits problem, where at each round the learner plays a subset of K arms from a feasible set, and the goal is to maximize the expected cumulative linear rewards. Existing algorithms have per-round time complexity at least Ω(K), which becomes expensive when K is large. To address this computational issue, we propose Faster CUCB whose sampling rule takes time sublinear in K for common classes of matroids: O(D polylog (K) polylog (T)) for uniform matroids, partition matroids, and graphical matroids, and O(D

Kpolylog (T)) for transversal matroids. Here, D is the maximum number of elements in any feasible subset of arms, and T is the horizon. Our technique is based on dynamic maintenance of an approximate maximum-weight basis over inner-product weights. Although the introduction of an approximate maximum-weight basis presents a challenge in regret analysis, we can still guarantee an upper bound on regret as tight as CUCB in the sense that it matches the gap-dependent lower bound by Kveton et al. (2014a) asymptotically.

1. Introduction

Matroid semi-bandits model many real-world tasks. An instance of matroid semi-bandit is described by ([K], X, µ), where [K] {1, , K} is the ground set, each k [K] is associated with a probability distribution νk with mean µk, and X {0, 1}K is the set of bases of a given matroid M = ([K], I) of rank D. At each round t [T], the learner pulls an action x(t) X and observes a semi-bandit feedback, i.e., yk(t) νk iff xk(t) = 1. This formulation can be used to model online advertisting and news selection (Kale et al., 2010) with M as a uniform matroid. Ad placement

Work done during an internship at Cyber Agent. 1EECS, KTH Royal Institue of Technology, Sweden 2AI Lab, Cyber Agent, Japan. Correspondence to: Ruo-Chun Tzeng <rubys88684@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

(Bubeck et al., 2013; Streeter et al., 2009) and diversified recommendation (Abbassi et al., 2013) can be modeled with M as a partition matroid. Network routing (Kveton et al., 2014a) can be modeled with M as a graphical matroid. Task assignment (Chen et al., 2016) can be modeled with M as a transversal matroid.

Popular algorithms include Combinatorial Upper Confidence Bound (CUCB) (Gai et al., 2012; Chen et al., 2013; Kveton et al., 2014a; 2015), Combinatorial Thompson Sampling (CTS) (Wang and Chen, 2018; Kong et al., 2021; Perrault, 2022), and the instance-specifically optimal algorithm KL-based Efficient Sampling for Matroids (KL-OSM) (Talebi and Proutiere, 2016). All of these algorithms rely on a greedy algorithm (see Algorithm 1) to determine the action to be pulled. The greedy algorithm takes time at least Ω(K) and at most O(K(log K + Tmember)), where Tmember is the time taken to determine whether x + ek I for some (x, k) I [K], and ek is the k-th canonical unit vector. However, when the number K of arms is large, performing the greedy algorithm at each round can become expensive. There is a need to develop a matroid semi-bandit algorithm with per-round time complexity sublinear in K.

In this work, we present Faster CUCB (Algorithm 5), the first sublinear-time algorithm for matroid semi-bandit. The design of Faster CUCB is based on CUCB, but with a much faster sampling rule which takes time sublinear in K for many classes of matroids. For uniform matroids, partition matroids, and graphical matroids, it has per-round time complexity of O(D polylog (K) polylog (T)), which is optimal up to a polylogarithmic factor as compared to the trivial lower bound of Ω(D). For transversal matroids, the perround time complexity is O(D

K polylog (T)), which is still sublinear in K when D = O(K 1 2 ϵ) for any ϵ > 0. Faster CUCB trades the accuracy for computational efficiency. In other words, the action computed by the sampling rule of Faster CUCB is an approximation to the optimal solution computed by the sampling rule of CUCB. This introduces difficulty in the regret analysis because prior analysis of CUCB (Kveton et al., 2014a) requires the exact solution. What is interesting is that we can still guarantee the same regret upper bound asymptotically as prior analysis of CUCB.

To develop a sublinear-time sampling rule, we present a dynamic algorithm for maintaining maximum-weight base

Matroid Semi-Bandits in Sublinear Time

CUCB Faster CUCB

Per-round Time Complexity O(K(log K + Tmember)) O(D polylog (T) Tupdate(A))

Uniform Matroid O(K log K) O(D log K polylog (T)) Partition Matroid O(K log K) O(D log K polylog (T)) Graphical Matroid O(K log K) O(D polylog (K) polylog (T)) Transversal Matroid O(K(log K + DK)) O(D

K polylog (T))

Table 1. Per-round time complexity of CUCB (Kveton et al., 2014a) and Faster CUCB (Algorithm 5) for different classes of matroids. K is the number of arms and D is the maximum number of elements in any action in X. Tmember for different matroids is discussed in Appendix C. Tupdate(A) for different matroids is discussed in Section 3.

over inner product weights (Section 4). There have been many sublinear-time algorithms for dynamic maximumweight base maintenance (see Section 3), which, however, may not be directly used in Faster CUCB because all arm weights representing the UCB index can change simultaneously at each round. Our insight for addressing this issue is that the UCB index of each arm k at round t can be decomposed into an inner product of the following twodimensional vectors: (1) a feature, which depends on k and is supposedly a pair of the empirical reward estimate and radius of confidence interval, and (2) a query, which depends only on round t. Our proposed dynamic algorithm consists of two speeding-up techniques. One is feature rounding, which rounds each feature into a few bins so as to reduce the number of distinct features to consider. The other is the minimum hitting set technique, which allows us to compute a small number of queries in advance and correctly identify an (approximate) maximum-weight base for any query.

Sections are organized as follows. We introduce matroid semi-bandits and basic concepts in Section 2. We review relevant literature in Section 3. We develop a dynamic algorithm for maintaining a maximum-weight base over inner product weights in Section 4. We propose Faster CUCB based on the algorithms developed in Section 4 and analyzed its regret and time complexity in Section 5.

2. Preliminaries

We use [n] to denote the set {1, , n}. We use i (µ) to denote any element in argmaxx X µ, x , and when it is clear from the context, we drop µ from i (µ) and write i . We use supp ( ) to denote the support set of a given vector. We use ek to denote the vector with 1 only on the k-th row and 0 s elsewhere, and use 0K to denote a K-dimensional vector with 0 on every row. We use log with base e. See Appendix A for a table of notation.

Matroid. A matroid is described by M ([K], I), where [K] is called the ground set and I {0, 1}K is the set of independent sets satisfying (i) hereditary property, i.e., if supp (y) supp (x) and x I, then y I; and

(ii) augmentation property, i.e., if x, y I and supp (y) supp (x), then there exists j supp (x) \ supp (y) such that y + ej I. We said x I is a basis if supp (x) is maximal, i.e., there does not exist y I such that supp (x) supp (y). All bases have the same cardinality, which is called the rank of the matroid. For v RK + , a maximum-weight basis i (v) argmaxx X PK k=1 vkxk can be found by a greedy algorithm (Algorithm 1) in O(K(log K + Tmember)) time, where Tmember is the time taken for the membership oracle to determine whether x + ek I for some x I and some k [K]\ supp (x).

Algorithm 1 A greedy maximum-weight basis algorithm

Input: v RK and the bases X {0, 1}K. Sort v in non-increasing order: vγ(1) vγ(K); x = 0K; i = 1; while x 0 < D do

if x + eγ(i) I then

x x + eγ(i) i i + 1; end

Matroid semi-bandits. An instance of matroid semibandit is described by ([K], X, µ), where [K] is the ground set, X {0, 1}K is the set of bases of the given matroid M ([K], I) of rank D, and I {0, 1}K is the set of independent sets. Each k [K] is associated with a distribution νk with mean µk. The learner knows the matroid M, and aims to learn the best action i (µ) argmaxx X µ, x by playing a game with the environment: At each round t N, the learner plays an action x(t), and the environment draws a noisy reward yk(t) νk for each arm k [K] and reveals yk(t) to the learner iff k supp (x(t)). We assume arms rewards are bounded:

Assumption 2.1. Assume the support of each arm νk is a subset of [a, b] and 0 < a < b < .

The performance is measured by expected regret:

R(T) T µ, i (µ) E

t=1 µ, x(t)

Matroid Semi-Bandits in Sublinear Time

which is the difference between the expected cumulative reward of the learner and that of an algorithm who knows µ and always selects the best action i (µ).

Common classes of matroids. Refer to Chapter 39 (Schrijver, 2003) or Chapter 1 (Oxley, 2011) for more details.

A uniform matroid ([K], I) of rank D has independent sets I = {S [K] : |S| D} and the bases X consist of subsets whose cardinalities are exactly D.

A partition matroid ([K], I) of rank D is given a partition S1, , SD of the ground set [K], the independent sets I = {S : |S Si| 1, i [D]}, and the bases X are subsets that choose exactly one element from each of the D sets.

A graphical matroid is given a graph G = (V, E) with K edges, the bases X consist of all spanning forests in G, and the rank D is |V | minus the number of connected components in G.

A transversal matroid is given a bipartite graph G = ([K] V, E) with a bipartition ([K], V ), |V | K, the independent sets I consist of S [K] such that there is a matching of S to |S| vertices in V , and X is the set of endpoints in [K] of all maximum matchings in G.

We discuss the query time Tmember of membership oracle in Appendix C. For more examples on semi-bandits under different matroid constraints, we refer the readers to (Kveton et al., 2014a) for linear matroids, and (Kveton et al., 2014b) for polymatroid semi-bandits.

CUCB. The action selected by CUCB (Gai et al., 2012; Chen et al., 2013; Kveton et al., 2014b) at round t N is:

x(t) argmax x X

ˆµk(t 1) + λt p

(1) where ˆµk(t) 1 Nk(t) Pt s=1 yk(s)1{xk(s) = 1} is the em-

pirical reward estimate, Nk(t) Pt s=1 1{xk(s) = 1} is the number of pulls of arm k, and λt > 0 controls the confidence width. The value ˆµk(t 1) + λt

Nk(t 1) is called

the UCB index of arm k. In (Kveton et al., 2014a), Eq. (1) is solved by a O(K(log K + Tmember))-time greedy algorithm shown in Algorithm 1. In Section 4, we will develop a faster algorithm for solving Eq. (1) with the following reformulation:

x(t) argmax x X

k=1 f k, q xk,

where f k = (ˆµk(t 1), 1

Nk(t 1)) and q = (1, λt).

3. Related Works

Semi-bandits and sublinear-time bandits. We provide an extensive survey on related bandit literature in Appendix B. To summarize here, for semi-bandit algorithms, CUCB (Gai et al., 2012; Chen et al., 2013; Kveton et al., 2014a), CTS (Wang and Chen, 2018; Kong et al., 2021; Perrault, 2022) and KL-OSM (Talebi and Proutiere, 2016) all rely on a O(K(log K + Tmember))-time greedy algorithm to compute the action to be pulled. In contrast, our Faster CUCB, as far as we know, is the first semi-bandit algorithm having per-round time complexity of o(K). For linear bandits, there exist several works (Jun et al., 2017; Yang et al., 2022) on reducing the per-round complexity to be sublinear in the number of arms. But, their results transferred to our setting are worse than what we have obtained both in terms of regret bound and the time complexity (see the discussion in Appendix B).

Dynamic maintenance of maximum-weight base of a matroid Here, we review existing dynamic algorithms for maintaining a maximum-weight base of a matroid. In a standard (fully-)dynamic setting, we are given a weighted matroid M = ([K], I), where each arm s weight dynamically changes over time in an online manner. The objective is to maintain any (exact or approximate) maximum-weight basis of M over up-to-date arm weights as efficiently as possible. We use Tupdate(A) to denote the time complexity of a dynamic algorithm A required for updating an (approximate) maximum-weight base according to the change of a single arm weight. The best-known bound of Tupdate(A) for each matroid class is summarized as follows: For graphic matroids, a maximum-weight basis can be updated in O(

K) worst-case time (Frederickson, 1985; Eppstein et al., 1997) and in O(polylog (K)) amortized time (Holm et al., 2001). For laminar matroids (which include uniform and partition matroids as special cases), the worst-case time complexity for exact dynamic algorithms is bounded by O(log K) (Henzinger et al., 2023). For transversal matroids, a O(K1.407)- time exact dynamic algorithm is known (van den Brand et al., 2019), while a (1 + η)-approximation dynamic algorithm runs in O(η 2

K) time (Gupta and Peng, 2013). We can safely assume that, after updating multiple arm weights, A returns an (approximate) maximum-weight basis in O(D) time, where D is the rank of a matroid.

4. Dynamic Maintenance of Maximum-weight Base over Inner Product Weight

In this section, we develop a sublinear-time sampling rule, which is used as a subroutine in Faster CUCB. Recall that any static algorithm that solves the linear maximization of Eq. (1) from scratch requires at least Ω(K) time, which is computationally expensive. To circumvent this issue,

Matroid Semi-Bandits in Sublinear Time

we present a dynamic algorithm for maintaining an (approximate) maximum-weight base of a matroid where arm weights change over time. The next subsection begins with formalizing the problem setting.

4.1. Problem Setting and Technical Result

Consider the following problem setting: Let M = ([K], I) be a matroid of rank D over K arms, and X be the set of its bases. Each arm k [K] has a (nonnegative) twodimensional vector f k = (αk, βk) R2 + referred to as a feature, which may change as time goes by. Given a (nonnegative) two-dimensional vector q R2 + as a query, we are required to find any (approximate) maximum-weight base of M, where arm k s weight is given by projecting its feature onto q, i.e., f k, q .

Observe that in the matroid semi-bandit setting, each arm k s feature f k = (αk, βk) corresponds to a pair of the empirical reward estimate αk = ˆµk(t 1) and radius βk = 1

Nk(t 1) of confidence interval, and a query is q = (1, λt) at round t, both of which change over rounds.

Hereafter, we make the following two assumptions.

Assumption 4.1. Lower and upper bounds, denoted by αlb and αub (resp. βlb and βub), on the possible positive values of αk s (resp. βk s) at anytime are known; namely, it always holds that αk {0} [αlb, αub] and βk {0} [βlb, βub]. The precise values of these bounds will be discussed in Section 5.

Assumption 4.2. There exists a dynamic algorithm A for maintaining a (1 + η)-approximate maximum-weight base of M with arm weights changing over time, where parameter η (0, 1) specifies the approximation guarantee. Denote by Tinit(A; η) and Tupdate(A; η) the time complexity of A required for initializing the data structure and updating a single arm s weight, respectively. We can safely assume that after updating multiple arm weights, A returns a maximumweight base in O(D) time. See Section 3 for existing implementations.

Our dynamic algorithm is parameterized by a precision parameter ϵ (0, 1), and consists of the following three procedures:

INITIALIZE: Given lower and upper bounds [αlb, αub] and [βlb, βub] as in Assumption 4.1, K features (αk, βk)k [K], a matroid M = ([K], I), a dynamic algorithm A for maximum-weight base maintenance as in Assumption 4.2, and a precision parameter ϵ, this procedure initializes the data structure used in the remaining two procedures.

FIND-BASE: Given a query q, this procedure is supposed to return a (1 + ϵ)-approximate maximum-weight base

of M, where arm k s weight is defined as f k, q for the up-to-date k s feature f k.

UPDATE-FEATURE: Given an arm k and a new feature f k, this procedure reflects the change of arm k s feature on the data structure.

Remark 4.3. Our problem setting is different from a canonical setting of dynamic maximum-weight base maintenance in a sense that the arm weights are revealed when a query is issued in FIND-BASE. Consequently, existing dynamic algorithms may not be used directly.

The technical result in this section is stated below.

Theorem 4.4 ( ). There exist implementations of INI-

TIALIZE, FIND-BASE, and UPDATE-FEATURE such that the following are satisfied: FIND-BASE always returns a (1 + ϵ)-approximate maximum-weight base of a matroid M with arm k s weight defined as f k, q for an up-to-date k s feature f k and a query q. Moreover, INITIALIZE runs in O(K + poly (W) Tinit(A; ϵ

3)) time, FIND-BASE runs in O( poly (W) + D) time, and UPDATE-FEATURE runs in O( poly (W) Tupdate(A; ϵ

3)) time, where

W = O ϵ 1 log αub

Remark 4.5. The proof of Theorem 4.4 can be easily adapted to the case when (the update procedure of) dynamic algorithm A has only amortized complexity. In such case, Theorem 4.4 holds in the amortized sense rather than the worst-case sense.

The remainder of this section is organized as follows: In Section 4.2, we apply a rounding technique to arm features to reduce the number of distinct features to consider, in Section 4.3, we investigate the representability of permutations induced by inner product weights to deal with multiple queries efficiently, and Section 4.4 finally develops our dynamic algorithm for maximum-weight base maintenance. All proofs of the lemmas appearing in this section are deferred to Appendix D.

4.2. Rounding Arm Features

Here, we apply a rounding technique to arm features so as to reduce the number of distinct features to consider. Hereafter, let η ϵ

3, so that (1 + η)2 1 + ϵ for ϵ (0, 1). Define

W max log1+η

W { } [W] = { , 1, 2, 3, . . . , W}. (3)

Since any features are within ({0} [αlb, αub]) ({0} [βlb, βub]) as guaranteed by Assumption 4.2, we shall partition the possible region of the features into |W|2 bins. For

Matroid Semi-Bandits in Sublinear Time

each q, r W, define BINq,r R2 + as

BINq,r (αlb(1 + η)q 1, αlb(1 + η)q]

(βlb(1 + η)r 1, βlb(1 + η)r], (4)

where (αlb(1 + η) , αlb(1 + η) ] {0}, (5)

(βlb(1 + η) , βlb(1 + η) ] {0}. (6)

Note that these bins are pairwise disjoint, and that q,r WBINq,r covers ({0} [αlb, αub]) ({0} [βlb, βub]); i.e., any possible feature belongs to a unique BINq,r. For each q, r W, let domq,r R2 + denote the unique dominating point of BINq,r; namely,

domq,r (αlb(1 + η)q, βlb(1 + η)r). (7)

For any feature f k = (αk, βk), we will use dom(f k) = dom(αk, βk) to denote the dominating point domq,r such that f k BINq,r. See Figure 1 in Appendix D for illustration of BINq,r s, domq,r s, and dom(f k) s.

Observe easily that for any feature f k R2 + and query q R2 +,

1 1 + η dom(f k), q < f k, q dom(f k), q . (8)

By Eq. (8), we can replace each arm s feature by its dominating point without much deteriorating the quality of the (approximate) maximum-weight base, as shown below.

Lemma 4.6 ( ). Let f 1, . . . , f K ({0} [αlb, αub]) ({0} [βlb, βub]) be K features, q R2 + be a query, and x dom be a (1 + η)-approximate maximum-weight base of matroid M with arm k s weight defined as dom(f k), q . Then, for any base x of M, it holds that

k supp(x dom) f k, q 1 1 + ϵ X

k supp(x) f k, q . (9)

In particular, x dom is a (1 + ϵ)-approximate maximumweight base with arm k s weight defined as f k, q .

4.3. Handling Multiple Queries

From weighting to permutation. Now we deal with multiple queries. Our idea is that, if two queries q1, q2 R2 + are very close to each other, then they should derive the common maximum-weight base (provided that features are fixed). This intuition can be justified with respect to orderings of arms. For two total orders and over [K], we say that is consistent with if k k implies k k

for any k = k . The following fact is easy to confirm:

Lemma 4.7 ( ). Let w = (w1, . . . , w K) RK + be K arm weights and be a total order over [K] such that k k if and only if wk wk . Let be a total order over [K] that

is consistent with . If x is a base of matroid M obtained by running the greedy algorithm over any ordering of [K] consistent with , it is a maximum-weight base of M over arm k s weight wk; namely, for any base x of M,

x , w x, w . (10)

Lemma 4.7 implies that any maximum-weight base can be obtained by running the greedy algorithm over some total order ; moreover, we can safely assume that is strict (i.e., k k or k k for all k = k ), or equivalently, a permutation over [K]. Our strategy for dealing with multiple queries is: (1) we enumerate all possible permutations in advance, and (2) we guess a permutation consistent with the arm weights determined based on a query. To this end, the following question arises: What kind of and how many permutations are representable given a fixed set of features?

Characterizing representable permutations. To answer the above question, we characterize representable permutations. Hereafter, let SK denote the set of all permutations over [K], and f 1, . . . , f K R2 + be any fixed, distinct K features. We say that a query q R2 over f 1, . . . , f K represents a permutation π SK if f π(i), q > f π(j), q for all 1 i < j K,1 and that π is representable if such q exists.

For a permutation π SK to be representable, we wish for some query q R2 to ensure that for any i < j, arm π(i) s weight is (strictly) higher than arm π(j) s weight. This requirement is equivalent to f π(i) f π(j), q > 0; thus, if the following system of linear inequalities is feasible, any of its solutions q represents π:

f π(i) f π(j), q > 0 for all 1 i < j K. (11)

Observe now that the above system is feasible if and only if the intersection of Pi,j for all i < j is nonempty, where Pi,j is an open half-plane defined as

Pi,j {q R2 : f π(i) f π(j), q > 0}. (12)

Because each Pi,j is obtained by dividing R2 by a unique line that is orthogonal to line f π(i)f π(j) and intersects the origin 0, the set of feasible solutions for Eq. (11) is equal to (the interior of) a polyhedral cone defined by the boundaries of a particular pair of Pi,j s.

Here, we characterize the representable permutations by the concept of arrangement of lines. Let L {l1, . . . , l( K 2)}

denote K 2 lines, each of which is orthogonal to line f kf k for some k = k and intersects 0. Given such L, a cell C

1This definition does not allow ties ; i.e., no pair k = k

satisfies f k, q = f k , q .

Matroid Semi-Bandits in Sublinear Time

in arrangement of L is defined as a maximum connected region of R2 that does not intersect with L (which is the interior of a polyhedral cone). Then, for each cell C, every query q in C represents the same permutation πC SK depending only on C; namely, there is a bijection between the representable permutations and the cells in arrangement of L. See Figure 2 in Appendix D for illustration.

With this connection in mind, we demonstrate that reserving a single vector for each cell suffices to cover all representable permutations. A minimum hitting set of the cells in arrangement of L is defined as any minimum set H of vectors in R2 such that H and each cell have a non-empty intersection.

Lemma 4.8 ( ). Let H be a minimum hitting set of the cells in arrangement of L. Then, for any query q R2, there exists a vector h H such that for any k = k ,

f k, h > f k , h = f k, q f k , q . (13)

Lemma 4.8 along with Lemma 4.7 ensure that for any query q R2, there is a vector h in H such that a maximumweight base with arm k s weight f k, h is a maximumweight base with arm k s weight f k, q .

Generating a minimum hitting set. Subsequently, we generate a minimum hitting set. One may think that it requires exponentially long time because the number of permutations in SK is K!. However, it turns out that the number of cells in arrangement of L is O(K2) (i.e., so is the number of representable permutations), and a minimum hitting set can be constructed in poly (K) time.

Lemma 4.9 ( ). The number of cells in arrangement of L is at most O(K2). Moreover, we can generate a minimum hitting set in poly (K) time (by using Algorithm 6 in Appendix D).

4.4. Putting It All Together: Algorithm Description and Complexity

We are now ready to implement the three procedures. We here stress that applying either of the feature rounding or minimum hitting set technique separately does not make sense: On one hand, if we only apply feature rounding, we would have to recompute each arm s weight every time a query is issued, which is expensive. On the other hand, if the minimum hitting set technique is only applied (to raw features f k s), then a minimum hitting set H cannot be constructed in advance due to a dynamic nature of features, and its size would be O(K2), which is prohibitive.

By applying both techniques, (1) we know a priori the set of possible dominating points, whose size O(W 2) depends only on W; moreover, (2) we can create a minimum hitting set H of size O(W 4) beforehand at initialization.

Pseudocodes of INITIALIZE, FIND-BASE, and UPDATEFEATURE are described in Algorithms 2 to 4, respectively. The proof of Theorem 4.4 follows from Lemmas 4.6 to 4.9, whose details are deferred to Appendix D.

In INITIALIZE, we construct a hitting set H of domq,r s and 1 1+η domq,r s rather than solely of domq,r s, which incurs a constant-factor blowup in the time complexity. Though this change is not needed in the proof of Theorem 4.4, the following immediate corollary of Lemma 4.8 is crucial in the regret analysis of Section 5.

Corollary 4.10 ( ). Let H be a minimum hitting set constructed in Algorithm 2. Then, for any query q R2, there exists a vector h H such that for any dom = domq,r and dom = domq ,r with q, r, q , r W,

dom, h > dom , h = dom, q dom , q ,

dom, h > dom , h

1 + η = dom, q dom , q

Algorithm 2 INITIALIZE. Input: lower and upper bounds [αlb, αub] and [βlb, βub]; K features (f k)k [K]; precision parameter ϵ (0, 1). Define W by Eq. (3); for each q, r W do

Define BINq,r and domq,r by Eqs. (4) and (7); end Define η ϵ

3; Construct a minimum hitting set H of size O(W 4) for domq,r s and 1 1+η domq,r s by Lemma 4.9; for each h H do

Create an instance Ah of dynamic maximum-weight base algorithm with precision parameter η ϵ

3, M, and arm k s weight dom(f k), h ; end

Algorithm 3 FIND-BASE.

Input: query q R2 +. Find h H s.t. q and h belong to (the closure of) the same cell in arrangement of V; Call Ah and return the maximum-weight base x of M with arm k s weight dom(f k), h ;

Algorithm 4 UPDATE-FEATURE.

Input: arm k [K]; new feature f k R2 +. for each h H do

Change arm k s weight stored in Ah to dom(f k), h ; end

5. Our Proposed Algorithm: Faster CUCB

In this section, we present Faster CUCB in Algorithm 5, which uses procedures introduced in Section 4.

Matroid Semi-Bandits in Sublinear Time

The purpose of initialization procedure is to ensure each arm is pulled at least once. It takes at most K rounds, and in each round, the computation of i (ek) takes O(K Tmember) time because the permutation γ in Algorithm 1 can be specified explicitly as γ(j) = k if j = 1, γ(j) = j 1 if 2 j < k, and γ(j) = j + 1 if j k. So, it only require to compute at most K membership tests. After the initialization, the computation of each round t consists of one call to FIND-BASE for computing the action x(t), the update of ˆµk(t) and Nk(t) for each k supp (x(t)), and one call to UPDATE-FEATURE for updating the feature of each arm k supp (x(t)) stored in the instances of the dynamic maxium-weight base algorithm.

Algorithm 5 Faster CUCB Input: the total number of rounds T, λt, and m N Initialization:

t = 0; while k [K] such that Nk(t) = 0 do

Pull i (ek); t = t + 1; end

INITIALIZE a, b, 1

T , 1, (ˆµk(t), Nk(t))k [K] , 1 logm T

while t < T do

x(t) FIND-BASE((1, λt)); Pull x(t) and receive yk(t) νk for each k supp (x(t)); for k supp (x(t)) do

Nk(t) Nk(t 1) + 1; ˆµk(t) t 1

t ˆµk(t 1) + 1

UPDATE-FEATURE k, µk(t), 1

end t = t + 1; end

5.1. Per-round Time Complexity

By Theorem 4.4, one call to FIND-BASE takes O ( poly (W) + D) and D calls to UPDATE-FEATURE take O(D poly (W) Tupdate(A; ϵ

W = O logm T log b

T = O logm+1 T ,

the per-round time complexity of Algorithm 5 is

O D polylog (T) Tupdate A; ϵ

. Here, we will set ϵ = 1 logm T for the regret analysis.

5.2. Regret Upper Bound

Notation. Fix µ Λ and i argmaxx X µ, x . We introduce a few notation. Let {j}D j=1 be the permutation of supp (i ) such that µ1 µD. Define j,k µj µk and dk max{j [D] : j,k > 0} for j supp (i ) and k / supp (i ), and min mink/ supp(i ) dk,k.

Theorem 5.1. Let λt = p

1.5(b a)2 log t and m N. Define T0 max{K, exp(( b min ) 1 m )}. For T N, the expected regret of Algorithm 5 is upper bounded by

k/ supp(i )

j=1 j,k T0 + 12 dk,k(b a)2 log T

µdk 1+log m T µk 2

k/ supp(i )

As a consequence of Theorem 5.1, setting T yields:

lim T R(T) log T X

k/ supp(i )

which matches Theorem 4 in (Kveton et al., 2014a), lim inf T R(T ) log T = Ω( K D

min ), asymptotically up to a constant factor. Note that Faster CUCB is faster than CUCB when min = Ω( 1 polylog(K)) and when T = poly (K). Also, similar to (Cuvelier et al., 2021b), our per-round time complexity also goes to infinity as T , one way to address this issue is to use CUCB when the per-round time complexity of ours is larger than that of CUCB.

Useful lemmas. Here we present two lemmas that will be used to show Theorem 5.1 in Section 5.3. First, inspired by Kveton et al. (2014a), we define a bijection gt from supp (i ) to supp (x(t)) with the following properties:

Lemma 5.2. There exists a bijection gt : supp (i ) supp (x(t)) such that (i) gt(j) = j for j supp (i ) supp (x(t)); (ii) for any j supp (i ) \ supp (x(t)),

xgt(j)(t) = 1 = D dom(f gt(j)), h E

dom(f j), h

1 + 1 3 logm T .

The proof of Lemma 5.2 is in Appendix E.1, where an explicit construction of gt is provided. Property (i) allows us to decompose the instantaneous regret i x(t), µ = P

k/ supp(i ) P

j supp(i ) j,k1{gt(j) = k}, and Property (ii), Lemma 5.2, allows us to derive a bound of PT t=1 1{gt(j) = k} that relates with UCB values.

Second, for technical reasons, we need the precision parameter ϵ = log m T to be small enough so that i,j and µi (1+ϵ)µj have the same sign. The below lemma (proved in Appendix E.2) gives the threshold to make it happen:

Lemma 5.3. Let ϵ < min

b . Then, for any i supp (i ) and any j / supp (i ), µi µj > 0 = µi 1+ϵ µj > 0.

5.3. Proof of Theorem 5.1

For T T0, R(T) is trivially bounded by T µ, i Db T0. In the following, we assume T > T0.

Matroid Semi-Bandits in Sublinear Time

As gt is a bijection from supp (i ) to supp (x(t)) and gt(j) = j for j supp (i ) supp (x(t)), we can rewrite

E[ i x(t), µ ] = X

k/ supp(i )

j supp(i ) j,k E[1{gt(j) = k}]

k/ supp(i )

j=1 j,k E 1 gt(j) = k

so that the expected regret is bounded from the above by:

k/ supp(i )

t=1 1 gt(j) = k #

k/ supp(i )

j=1 j,k (I)j,k + (II)j,k ,

(I)j,k = PT t=1 E h 1 n gt(j) = k, Nk(t) nj,k oi

(II)j,k = PT t=1 E h 1 n gt(j) = k, Nk(t) > nj,k oi

and nj,k = max 6(b a)2 log T ( µj 1+log m T µk)2 , T0

. The proof is

completed by bounding related terms of (I)j,k and (II)j,k by Lemma 5.4 (proved in Appendix E.2) and Lemma 5.5. Lemma 5.4. Let k / supp (i ) and j [dk]. For T > T0,

j=1 j,k(I)j,k

j=1 j,k T0 + 12(b a)2 dk,k log T

( µdk 1+log m T µk)2 .

Lemma 5.5. Let k / supp (i ) and j [dk]. For T > T0,

Proof sketch: See Appendix E.2 for the entire proof. Let ϵ 1 logm T . First, we claim:

gt(j) = k = uk(Nk(t 1), T) mins<t uj(s, t)

(14) where uk(s, t) = µk(s) + λt s and µk(t) = 1

t Pt s=1 yk(s).

Show Eq. (14): Observe that gt(j) = k implies: 1 + ϵ

f k, q dom(f k), q

D dom(f j), q E

where Eq. (8) is used in the first and the last inequality, and the second uses Lemma 5.2 and Corollary 4.10. By (1+ ϵ

3)2 1+ϵ and expanding f k = (ˆµk(t 1), 1

and q = (1, λt), we derive from (15) that:

uk(Nk(t 1), t) uj(Nj(t 1), t)

and further by log T > log t and Nj(t 1) [t 1],

uk(Nk(t 1), T) uj(Nj(t 1), t)

1 + ϵ mins<t uj(s, t)

which shows Eq. (14). Second, define

Tj,k = {t {nj,k+1, , T} : gt(j) = k, Nk(t 1) > n j,k}.

From Eq. (14), (II)j,k is upper bounded by

t Tj,k 1 uk(Nk(t 1), T) mins<t{uj(s, t)}

s<t (1{A1,t,s} + 1{A2,t,s} + 1{A3,t,s})

A1,t,s = µk(Nk(t 1)) µk + λT

A2,t,s = n µj µj(s) + λt s o

A3,t,s = µk + 2λT

Nk(t 1) > µj 1+ϵ

See Appendix E.2 for the derivation of Eq. (16). Observe when t Tj,k,

1{A3,t,s} 1

µk + 2λT pnj,k + 1 > µj 1 + ϵ

where the inequality is because Nk(t 1) > nj,k, and the equality is because

nj,k 4λ2 T ( µj 1+ϵ µk)2 = 4λ2 T nj,k + 1 < µj

and also µj 1+ϵ µk > 0 which is ensured by Lemma 5.3 as T > T0. Finally, from Eq. (16) and using Hoeffding s inequality, we get

s<t (1{A1,t,s} + 1{A2,t,s})

e 3 log T + e 3 log t .

See Appendix E.2 for the derivation of the second inequality. The proof is completed by evaluating (PT t=1 P

s<t e 3 log T PT t=1 t T 3 T (T +1)

s<t e 3 log t P t=1 t t3 P t=1 1 t2 π2

Matroid Semi-Bandits in Sublinear Time

6. Conclusion

In this paper, we have presented Faster CUCB, the first sublinear-time algorithm for matroid semi-bandits. Several possible future directions. First, one might extend our approach to speed up UCB-style algorithms for different problems such as combinatorial best-arm identification (Chen et al., 2014; Du et al., 2021) and nonstationary semi-bandits (Zhou et al., 2020; Chen et al., 2021). Second, another direction is to study the possibility of speeding up other forms of weights, such as those derived from gradients (Tzeng et al., 2023) and those in the follow-the-perturbed-leader algorithm (Neu and Bart ok, 2016).

Acknowledgement

Ruo-Chun Tzeng s research is supported by the ERC Advanced Grant REBOUND (834862). Kaito Ariu s research is supported by JSPS KAKENHI Grant No. 23K19986.

Impact Statement

This paper develops an algorithm for matroid semi-bandits. The societal consequences of this work indirectly come from the applications of matroid semi-bandits, and none of which we think should be discussed here.

Zeinab Abbassi, Vahab S Mirrokni, and Mayur Thakur. Diversity maximization under matroid constraints. In Proc. of KDD, 2013.

Alper Atamt urk and Andr es G omez. Maximizing a class of utility functions over the vertices of a polytope. Operations Research, 2017.

S ebastian Bubeck, Tengyao Wang, and Nitin Viswanathan. Multiple identifications in multi-armed bandits. In Proc. of ICML, 2013.

Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroid constraints. In Proc. of COLT, 2016.

Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multiarmed bandits. In Proc. of Neur IPS, 2014.

Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and applications. In Proc. of ICML, 2013.

Wei Chen, Liwei Wang, Haoyu Zhao, and Kai Zheng. Combinatorial semi-bandit in the non-stationary environment. In Proc. of UAI, 2021.

Sayak Ray Chowdhury, Gaurav Sinha, Nagarajan Natarajan, and Amit Sharma. Combinatorial categorized bandits with expert rankings. In Proc. of UAI, 2023.

Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, and Marc Lelarge. Combinatorial bandits revisited. In Proc. of Neur IPS, 2015.

Richard Combes, Stefan Magureanu, and Alexandre Proutiere. Minimal exploration in structured stochastic bandits. In Proc. of Neur IPS, 2017.

Thibaut Cuvelier, Richard Combes, and Eric Gourdin. Asymptotically optimal strategies for combinatorial semibandits in polynomial time. In Proc. of ALT, 2021a.

Thibaut Cuvelier, Richard Combes, and Eric Gourdin. Statistically efficient, polynomial-time algorithms for combinatorial semi-bandits. In Proc. of SIGMETRICS, 2021b.

R emy Degenne and Vianney Perchet. Combinatorial semibandit with known covariance. In Proc. of Neur IPS, 2016.

Yihan Du, Yuko Kuroki, and Wei Chen. Combinatorial pure exploration with full-bandit or partial linear feedback. In Proc. of AAAI, 2021.

Jack Edmonds. Matroids and the greedy algorithm. Mathematical Programming, 1971.

David Eppstein, Zvi Galil, Giuseppe F. Italiano, and Amnon Nissenzweig. Sparsification a technique for speeding up dynamic graph algorithms. Journal of the ACM, 1997.

Greg N. Frederickson. Data structures for on-line updating of minimum spanning trees, with applications. SIAM Journal on Computing, 1985.

Yi Gai, Bhaskar Krishnamachari, and Rahul Jain. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 2012.

Todd L Graves and Tze Leung Lai. Asymptotically efficient adaptive choice of control laws incontrolled markov chains. SIAM Journal on Control and Optimization, 1997.

Manoj Gupta and Richard Peng. Fully dynamic (1 + ϵ)- approximate matchings. In Proc. of FOCS, 2013.

Monika Henzinger, Paul Liu, Jan Vondr ak, and Da Wei Zheng. Faster submodular maximization for several classes of matroids. In Proc. of ICALP, 2023.

Jacob Holm, Kristian De Lichtenberg, and Mikkel Thorup. Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. Journal of the ACM, 2001.

Matroid Semi-Bandits in Sublinear Time

Shinji Ito. Hybrid regret bounds for combinatorial semibandits and adversarial linear bandits. In Proc. of Neur IPS, 2021.

Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. In Proc. of Neur IPS, 2017.

Satyen Kale, Lev Reyzin, and Robert E Schapire. Nonstochastic bandit slate problems. In Proc. of Neur IPS, 2010.

Jon Kleinberg and Eva Tardos. Algorithm design. Pearson Education India, 2006.

Fang Kong, Yueran Yang, Wei Chen, and Shuai Li. The hardness analysis of Thompson sampling for combinatorial semi-bandits with greedy oracle. In Proc. of Neur IPS, 2021.

Branislav Kveton, Zheng Wen, Azin Ashkan, Hoda Eydgahi, and Brian Eriksson. Matroid bandits: Fast combinatorial optimization with learning. In Proc. of UAI, 2014a.

Branislav Kveton, Zheng Wen, Azin Ashkan, and Michal Valko. Learning to act greedily: Polymatroid semibandits. ar Xiv preprint ar Xiv:1405.7752, 2014b.

Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In Proc. of AISTATS, 2015.

Gergely Neu and G abor Bart ok. Importance weighting without importance weights: An efficient algorithm for combinatorial semi-bandits. JMLR, 2016.

James G Oxley. Matroid theory, 2011.

Orestis Papadigenopoulos and Constantine Caramanis. Recurrent submodular welfare and matroid blocking semibandits. In Proc. of Neur IPS, 2021.

Pierre Perrault. When combinatorial Thompson sampling meets approximation regret. In Proc. of Neur IPS, 2022.

Pierre Perrault, Vianney Perchet, and Michal Valko. Exploiting structure of uncertainty for efficient matroid semibandits. In Proc. of ICML, 2019.

Pierre Perrault, Etienne Boursier, Michal Valko, and Vianney Perchet. Statistical efficiency of thompson sampling for combinatorial semi-bandits. In Proc. of Neur IPS, 2020a.

Pierre Perrault, Michal Valko, and Vianney Perchet. Covariance-adapting algorithm for semi-bandits with application to sparse outcomes. In Proc. of COLT, 2020b.

Alexander Schrijver. Combinatorial Optimization: Polyhedra and Efficiency. Springer, 2003.

Matthew Streeter, Daniel Golovin, and Andreas Krause. Online learning of assignments. In Proc. of Neur IPS, 2009.

Mohammad Sadegh Talebi and Alexandre Proutiere. An optimal algorithm for stochastic matroid bandit optimization. In Proc. of AAMAS, 2016.

Taira Tsuchiya, Shinji Ito, and Junya Honda. Further adaptive best-of-both-worlds algorithm for combinatorial semi-bandits. In Proc. of AISTATS, 2023.

Ruo-Chun Tzeng, Po-An Wang, Alexandre Proutiere, and Chi-Jen Lu. Closing the computational-statistical gap in best arm identification for combinatorial semi-bandits. In Proc. of Neur IPS, 2023.

Jan van den Brand, Danupon Nanongkai, and Thatchaphol Saranurak. Dynamic matrix inverse: Improved algorithms and matching conditional lower bounds. In Proc. of FOCS, 2019.

Siwei Wang and Wei Chen. Thompson sampling for combinatorial semi-bandits. In Proc. of ICML, 2018.

Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In Proc. of ICML, 2015.

Shuo Yang, Tongzheng Ren, Sanjay Shakkottai, Eric Price, Inderjit S Dhillon, and Sujay Sanghavi. Linear bandit algorithms with sublinear time complexity. In Proc. of ICML, 2022.

Huozhi Zhou, Lingda Wang, Lav Varshney, and Ee-Peng Lim. A near-optimal change-detection based algorithm for piecewise-stationary combinatorial semi-bandits. In Proc. of AAAI, 2020.

Matroid Semi-Bandits in Sublinear Time

A. Notation

Problem setting

K the number of arms X the bases of the given matroid ([K], I) D maxx X x 0 µ the mean vector of the K arms ν1, , νK i (µ) an action attaining maxx X µ, x

Notation related to Faster CUCB

Nk(t) the number of arm pulls of arm k x(t) the action selected by the algorithm at round t y(t) the reward vector at round t ˆµk(t) the empirical reward 1 Nk(t) Pt s=1 yk(t)1{xk(t) = 1} of arm k at round t λt the parameter that controls the confidence interval

Notation related to dynamic algorithm

f k = (αk, βk) a nonnegative two-dimensional feature of arm k q a nonnegative two-dimensional query (αlb, αub) lower and upper bounds of αk s (βlb, βub) lower and upper bounds of βk s W (the square root of) the number of bins BINq,r bins that partition the possible region of the features domq,r dominating point of BINq,r dom(f k) dominating point of BINq,r to which f k belongs L = {l1, . . . , l( K 2)} the set of K 2 lines, each of which is orthogonal to line f kf k for some k = k and intersects 0 H a minimum hitting set of the cells in arrangement of L

Notation related to regret analysis

{j}D j=1 the permutation of supp (i ) such that µ1 µD ϵ the precision parameter which is set to 1 logm T in Faster CUCB (Algorithm 5) gt(j) the mapping from supp (i ) to supp (x(t)) such that

(i) gt(j) = j if j supp (i ) supp (x(t))

(ii) xgt(j)(t) = 1 implies D dom(f gt(j)), h E 1 1+ ϵ

3 dom(f j), h

j,k the difference µj µk between arm j s and arm k s expected reward min the smallest positive gap i,j between any pair of i supp (i ) and j / supp (i ) dk the largest j [D] such that δ(j),k > 0 µk(t) the average 1

t Pt s=1 yk(s) of rewards of arm k in the first t rounds uk(s, t) the UCB value of µk(s) + λt s under s samples of arm k and with confidence parameter λt

Table 2. Table of notation.

Matroid Semi-Bandits in Sublinear Time

B. Further Related Works

In this section, we review relevant literatures on combinatorial semi-bandits and sublinear-time bandits. We focus on the stochastic setting. For ease of comparison, we assume the best action i argmaxx X x, µ is unique, and define min minj,k [K]:i j =1,i k=0,µj µk>0(µj µk), and minx =i : i x,µ >0 i x, µ .

Matroid semi-bandits. Kveton et al. (2014a) showed an instance such that any uniformly good algorithm2 suffer R(T) = Ω (K D) log T

. They also showed that CUCB (Gai et al., 2012; Chen et al., 2013) have a regret bound scaling

as O (K D) log T

. Talebi and Proutiere (2016) showed an instance-specific lower bound lim inf T R(T ) log T c(µ) for uniformly good algorithms, where c(µ) is the optimum of a semi-infinite linear program (Graves and Lai, 1997; Combes et al., 2015), and proposed KL-OSM whose regret upper bound matches this lower bound. The per-round compleixty of KL-OSM is K line search for generating the indices plus the time for solving a linear maximization. Both CUCB and KL-OSM rely on the greedy algorithm (Algorithm 1) to solve the linear maximization for determing the action to be pulled. The time complexity of the greedy algorithm is upper bounded by O(K(log K + Tmember)) time and lower bounded by Ω(K). (Perrault et al., 2019) showed that the sampling rule of many combinatorial semi-bandit algorithms is a maximization problem over a summation of a linear function and a submodular function, and proposed two efficient algorithms for matroid semi-bandits: One is based on local search and the other is a greedy algorithm. Both have per-round time complexity at least Ω(KD). In contrast, our Faster CUCB is the first matroid semi-bandit algorithm with per-round time complexity sublinear in K for many classes of matroids.

Combinatorial semi-bandits. Here, we review works that focus on the standard setting of stochastic combinatorial semi-bandits. These consider a linear reward function and any action sets X, where linear maximization maxx X x, v for any v RK can be solved in time polynomial in K. We omit the discussion on works that focus on a specific action set (Chowdhury et al., 2023), with additional structural assumptions on the rewards (Wen et al., 2015; Perrault et al., 2020b), or with a different reward function (Papadigenopoulos and Caramanis, 2021). Perrault et al. (2020a) showed that CTS has a regret bound of O( K log2 D log T

) for mutually independent gaussian rewards and a regret bound of O( KD log2 D log T

) for correlated gaussian rewards. Perrault (2022) sharpen the regret bound of CTS for the case of mutually independent gaussian rewards to be O( K log D log T

). The per-round time complexity of CTS is at least Ω(K) due to sampling from

the posterior distributions. Degenne and Perchet (2016) showed that ESCB2 has regret bound of O( K log2 D log T

) for independent subgaussian rewards, but its sampling rule is NP-hard (Atamt urk and G omez, 2017) to optimize. Cuvelier et al. (2021b) proposed AESCB that approximates ESCB2 with per-round time complexity of O(KD log3 K poly (log T)) while maintaining the same regret bound. Their technique is based on rounding and budgeted-linear maximization. OSSB (Combes et al., 2017) is an asymptotically instance-specifically optimal algorithm for general structured bandits, including combinatorial semi-bandits, but at each round, it requires to solve a semi-infinite linear program (Graves and Lai, 1997). Cuvelier et al. (2021a) developed a method that runs in time polynomial in K to solve the semi-infinite linear program for Gaussian rewards. They managed to maintain OSSB s asymptotic optimality for m-sets, but not for spanning trees and bipartite matchings. Ito (2021) and Tsuchiya et al. (2023) proposed algorithms based on the optimistic FTRL framework that achieve O( KD log T

) regret in the stochastic setting and O( KDT log T) in the adversarial setting. At each round, the proposed algorithms first use FTRL rule to obtain a vector a(t) in the convex hull of X and then sample an action x(t) based on a(t). Tsuchiya et al. (2023) mentioned that the computational efficiency of the sampling step has long been a problem in semi-bandits using the optimistic FTRL framework.

Sublinear-time linear bandits. Several works (Jun et al., 2017; Yang et al., 2022) focusing on making per-round complexity of linear bandits sublinear in the number of arms. Maximum Inner Product Search (MIPS) is the primary tool used to design such algorithms. For N arms in Rd, Q-GLOC (Jun et al., 2017) achieves a high-probability regret bound of O(d 5 4

T) and per-round time complexity of O(d2N ρ log N) for some ρ (0, 1), where O hides polylogarithmic factors in T and d. Yang et al. (2022) considered the setting with arms addition (resp. addition and deletion), and proposed Sub-Elim (resp. Sub-TS), which has a high-probability regret bound of O(d

T) (resp. O(d 3 2

T)) and per-round time complexity of N 1 Θ( 1 T 2 log2 T ) (resp. N 1 Θ( 1

T )). These results are applicable to our setting with d = K and N = |X|. Q-GLOC (Jun et al., 2017) applied to our setting has regret bound of O(K 5 4

T) and per-round time compleixty O(K2|X|ρ).

2A uniformly good algorithm has the expected regret R(T) = o(T α) hold for any α > 0.

Matroid Semi-Bandits in Sublinear Time

Sub-Elim (resp. Sub-TS) Yang et al. (2022) applied to our setting has regret bound of O(K

T) (resp. O(K 3 2

has per-round time complexity of |X|1 Θ( 1 T 2 log T ) (resp. |X|1 Θ( 1

T )). These results have worse regret bounds than what we have obtained, and their per-round time complexity can be exponential in K.

Matroid Semi-Bandits in Sublinear Time

C. Membership Oracles for Different Matroids

In this section, we discuss Tmember for the matroids shown in Section 2.

For uniform matroid, the membership oracle is given x I and k [K]\ supp (x), and has to check whether | supp (x + ek)| D. Suppose the number n = | supp (x)| is maintained. Then, it takes O(1) time to check whether n + 1 D, and hence Tmember = O(1).

For partition matroids, the membership oracle is given x I and k [K]\ supp (x), and has to check whether | supp (x + ek) Si| 1 for all i [D]. Suppose there is an integer array A of size K such that j SA[j] for each j [K], and suppose there is an integer array B of size D such that B[i] = P

j supp(x) 1{j Si} for each i [D]. Then, to decide whether whether x + ek I, it only requires to check whether B[A[k]] + 1 1. This can be implemented in O(1) time, and thus Tmember = O(1).

For graphical matroids, the membership oracle has to detect if there is a cycle. Using the union-find data structure, whether supp (x) {k} has a cycle can be detected in O(log K) time, so we have Tmember = O(log K). Refer to Section 4.6 in (Kleinberg and Tardos, 2006) for more detailed explanation.

For transversal matroids, there is little discussion about its membership oracle. Here we present an implementation to answer a query (x, k) about whether x + ek I, where x I and k [K]\ supp (x). Suppose a maximum matching M on supp (x) V is maintained. Then, answering whether x + ek I is equivalent to checking whether an augmenting path on supp (x + ek) V from M can be found. Finding a augmentation path can be done by a breadth-first search (BFS) starting from k (see Section 17.2 in (Schrijver, 2003)), and it takes O(DK) time because there are at most K leaves in the BFS tree and the length of the path from k to each leaf is at most 2D. Thus, we have Tmember = O(DK).

Matroid Semi-Bandits in Sublinear Time

BIN1,𝑊 BIN2,𝑊 BIN𝑊 1,𝑊BIN𝑊,𝑊

BIN1,𝑊 1 BIN2,𝑊 1 BIN𝑊 1,𝑊 1 BIN𝑊,𝑊 1

BIN1,2 BIN2,2 BIN𝑊 1,2 BIN𝑊,2

BIN1,1 BIN2,1 BIN𝑊 1,1 BIN𝑊,1

𝛽lb 1 + 𝜂𝑊 1

Figure 1. Illustration of feature rounding. There are |W|2 bins, and features are assumed not to be in (the interior of) the shaded area. Each feature f k is rounded to its dominating point dom(f k), which is specified by a curved arrow.

D. Omitted Proofs in Section 4

Proof of Lemma 4.6. By Eq. (8) and the optimality of x dom, for any base x, we have

k supp(x dom) f k, q

> 1 1 + η X

k supp(x dom) dom(f k), q (by Eq. (8))

1 (1 + η)2 X

k supp(x) dom(f k), q (by optimality of x dom)

1 (1 + η)2 X

k supp(x) f k, q (by Eq. (8))

k supp(x) f k, q . (as (1 + η)2 1 + ϵ)

Proof of Lemma 4.7. The proof follows from the uniqueness of the maximum-weight base in the case of distinct weights; see, e.g., (Edmonds, 1971).

Proof of Lemma 4.8. For a query q R2, let C R2 be a cell in arrangement of L whose closure contains q (which may not be uniquely determined). Then, there is a permutation π SK such that for any vector h H C, we have f π(i), h > f π(j), h whenever i < j. Since q is in the closure of C, it holds that f π(i), q f π(j), q for any i < j, implying the proof.

Proof of Lemma 4.9. Since each cell in arrangement of L is a polyhedral cone generated by two lines of L that does not contain any other lines of L, there are only O(K2) cells, and each of their internal points can be found by using Algorithm 6, as desired.

Matroid Semi-Bandits in Sublinear Time

Figure 2. Illustration of characterization of representable permutations. There are three features f 1, f 2, f 3 on R2. Each dashed line denotes f if j for some i = j; each black bold line is orthogonal to some dashed line and intersects the origin. Such black bold lines generate six regions, each corresponding to a distinct permutation. For example, for any query q in the hatched area, it holds that f 1, q > f 2, q > f 3, q ; i.e., q represents a permutation π such that (π(1), π(2), π(3)) = (1, 2, 3).

Algorithm 6 GENERATE-HITTING-SET. Input: K distinct features (f k)k [K]. let Θ ; for all k = k do

let L be a unique line that is orthogonal to line f kf k and intersects 0; add the angle θ of L and θ to Θ; end let H ; for all neighboring (but distinct) θ1 and θ2 in Θ do

let h cos( θ1+θ2

2 ), sin( θ1+θ2

2 ) be an internal point of a polyhedral cone generated by two half-lines whose angles are θ1 and θ2; add h to H; end return H;

Proof of Theorem 4.4. The correctness of FIND-BASE is shown first. Given a query q R2 +, Algorithm 3 finds h H such that f k, h > f k , h implies f k, q f k , q due to Lemma 4.8. Calling Ah finds a (1 + η)-approximate maximum-weight base x of M with arm k s weight defined as dom(f k), h . Since a total order over [K] induced by arm weights dom(f k), h is consistent with that induced by arm weights dom(f k), q , by Lemma 4.7, x is also a (1 + η)-approximate maximum-weight base of M with arm k s weight defined as dom(f k), q . By Lemma 4.6,

k supp(x ) f k, q 1 1 + ϵ X

k supp(x) f k, q , (17)

for any base x of M; namely, x is a (1 + ϵ)-approximate maximum-weight base of M with arm k s weight defined as f k, q , completing the correctness of FIND-BASE.

Subsequently, we bound the time complexity of each subroutine as follows.

INITIALIZE: Construction of BINq,r and domq,r for all q, r W completes in O(K + W 2) time. Then, a hitting set H for domq,r s and 1 1+η domq,r s of size |H| = O(W 4) can be constructed in poly (W) time due to Lemma 4.9. There will be |H| instances of algorithm A (with different arm weights), creating which takes O(W 4 Tinit(A; η)) time.

FIND-BASE: Checking whether each h H and query q R2 + belong to (the closure of) the same cell in arrangement of

Matroid Semi-Bandits in Sublinear Time

V can be done in O(W 2) time by comparing the induced total orders. By brute-force search, a desired h can be found in O(W 6) time. Since calling Ah requires O(D) time, the entire time complexity is bounded by O( poly (W) + D).

UPDATE-FEATURE: For |H| instances of A, a single arm s weight would be changed, each of which runs in Tupdate(A; η) time.

Observe finally that

W = max log1+η

αlb ) + log( βub

= O η 1 log αub

= O ϵ 1 log αub

where we used the fact that 1 log(1+η) < 1

η when η (0, 1), completing the proof.

Matroid Semi-Bandits in Sublinear Time

E. Proofs Related to Regret Analysis

E.1. Proofs Related to the Bijection gt

Lemma 5.2. There exists a bijection gt : supp (i ) supp (x(t)) such that (i) gt(j) = j for j supp (i ) supp (x(t)); (ii) for any j supp (i ) \ supp (x(t)),

xgt(j)(t) = 1 = D dom(f gt(j)), h E

dom(f j), h

1 + 1 3 logm T .

Proof: Let η 1 3 logm T . The proof is inspired by Section 4.2 in (Kveton et al., 2014a), and several changes are made to deal with the usage of the dynamic(1 + η)-approximate maximum-weight basis algorithm in the FIND-BASE procedure.

Let ξt : [D] supp (x(t)) be the ordering such that ξt(i) s arm weight D dom(f ξt(i)), h E is the i-th largest, where h H

lies in the same cell as the query q = (1, λt) when invoking FIND-BASE procedure.

Explicit construction of gt: We define

gt(j) = ξt(π 1 t (j)), j supp (i ) ,

where the function πt : [D] supp (i ) is a bijection such that the following hold:

(i) Pk 1 i=1 eξt(i) + eπt(k) I for all k [D]

(ii) πt(k) = ξt(k) if ξt(k) supp (i ) supp (x(t))

The existence of πt is proved in Lemma E.1 and also by Lemma 1 of (Kveton et al., 2014a).

Show (i) gt(j) = j for j supp (i ) supp (x(t)): Fix any j supp (i ) supp (x(t)). From the definition of πt, we have πt(j) = ξt(j) and hence gt(j) = ξt(π 1 t (j)) = ξt(ξ 1 t (j)) = j.

Show (ii) xgt(j)(t) = 1 = D dom(f gt(j)), h E dom(f j),h

1+η : Fix any j supp (i ) \ supp (x). Let k = π 1 t (j).

Observe that the bijection πt captures the situation that: The algorithm can choose πt(k) supp (i ) as the k-th element but instead chooses ξt(k) supp (x(t)). By the procedure of Algorithm 3 and Assumption 4.2, this happens when D dom(f ξt(k)), h E 1 1 + η

D dom(f πt(k)), h E ,

and replacing k = π 1 t (j) completes the proof.

Lemma E.1. Let x, i X, and ξ : [D] supp (x) be an arbitrary bijection. There exists a bijection π : [D] supp (i ) such that Pk 1 i=1 eξ(i) + eπ(k) I for all k [D].

Proof: This lemma is equivalent to Lemma 1 of (Kveton et al., 2014a). For reader s convenience, we provide a proof here.

For k = D, consider PD 1 i=1 eξ(i) I (due to hereditary property), and i I. As the former has D 1 element while the latter has D elements, by augmentation property, there exists π(D) supp (i ) such that PD 1 i=1 eξ(i) + eπ(D) I. For the case when ξ(D) supp (i ) supp (x), we set π(D) = ξ(D).

The proof is completed by repeating the following process for k = D 1, , 1. As Pk 1 i=1 eξ(i) I (due to hereditary property) has k 1 elements, and i PD i=k+1 eπ(i) I (due to hereditary property) has k elements, by augmentation property, there exists π(k) such that Pk 1 i=1 eξ(i) + eπ(k) I. If ξ(k) supp (i ) supp (x), we set π(k) = ξ(k).

E.2. Lemmas Related to Regret Analysis of Algorithm 5

In this section, we fix a best action i argmaxx X µ, x and define j,k µj µk. Let {j}D j=1 be the permutation of supp (i ) such that µ1 µD. Define dk max{j supp (i ) : j,k > 0} and min mink/ supp(i ) dk,k.

Matroid Semi-Bandits in Sublinear Time

Lemma 5.3. Let ϵ < min

b . Then, for any i supp (i ) and any j / supp (i ),

µi µj > 0 = µi 1 + ϵ µj > 0.

Proof: Fix i supp (i ) and j / supp (i ) such that µi µj > 0. We want the following to hold:

µi 1 + ϵ µj > 0 µi (1 + ϵ)µj > 0 µi µj > ϵµj.

As µi µj > ϵµj must hold for all such i and j, taking the minimum over all possible i and j on the left-hand side, and use the fact that µj b for all j on the right-hand side, we derive

is the condition on ϵ to ensure µi µj > 0 = µi 1+ϵ µj > 0 holds for all i and j.

Lemma 5.4. Let k / supp (i ) and j [dk]. For T > T0,

j=1 j,k(I)j,k

j=1 j,k T0 + 12(b a)2 dk,k log T

( µdk 1+log m T µk)2 .

Proof: Recall (I)j,k = PT t=1 E h 1 n gt(j) = k, Nk(t) nj,k oi , where nj,k = max 6(b a)2 log T ( µj 1+log m T µk)2 , T0

First, we claim that: for any {aj}dk j=1 with a1 adk 0,

j=1 aj(I)j,k a1n1,k +

j=2 aj(nj,k nj 1,k). (19)

Show Eq. (19): We show by induction. For the base case, we have

a1(I)1,k + a2(I)2,k a1n1,k + a2(n2,k n1,k). (20)

Eq. (20) is derived as follows. Since a1, a2 0 and

(I)1,k + (I)2,k =

t=1 E 1 gt(1) = k, Nk(t) n1,k + 1 gt(2) = k, Nk(t) n2,k max{n1,k, n2,k} = n2,k,

therefore we can bound (I)2,k as (I)2,k n2,k (I)1,k, yields that:

a1(I)1,k + a2(I)2,k (a1 a2)(I)1,k + 2,kn2,k.

Then, since a1 a2 and (I)1,k n1,k, we derive

a1(I)1,k + a2(I)2,k (a1 a2)n1,k + a2n2,k,

which shows Eq. (20). Now, assume for any {bj}ℓ j=1 with b1 bℓ 0, the following

j=1 bj(I)j,k b1n1,k +

j=2 bj(nj,k nj 1,k) (21)

holds for ℓ< dk.

Matroid Semi-Bandits in Sublinear Time

Fix any {aj}ℓ+1 j=1 with a1 aℓ+1 0. Consider Pℓ+1 j=1 aj(I)j,k. Since aj 0 for all j [ℓ+ 1] and

j=1 (I)j,k =

j=1 1 gt(j) = k, Nk(t) n1,k

max j [ℓ+1] nj,k = nℓ+1,k,

we can bound (I)ℓ+1,k as (I)ℓ+1,k nℓ+1,k Pℓ j=1(I)j,k, which results in:

j=1 aj(I)j,k

j=1 (aj aℓ+1)(I)j,k + aℓ+1nℓ+1,k.

Since a1 aℓ+1 aℓ aℓ+1 0, using inductive hypothesis Eq. (21) with bj = aj aℓ+1 for all j [ℓ], we get

j=1 aj(I)j,k (a1 aℓ+1)n1,k +

j=2 (aj aℓ+1)(nj,k nj 1,k) + aℓ+1nℓ+1,k.

j=2 aj(nj,k nj 1,k) aℓ+1

j=2 (nj,k nj 1,k) nℓ+1,k

j=2 aj(nj,k nj 1,k) + aℓ+1(nℓ+1,k nℓ,k)

j=2 aj(nj,k nj 1,k).

Thus, Eq. (19) is proved by induction.

Define ϵ log m T and j,k(ϵ) µj 1+ϵ µk. Using Eq. (19) with aj = j,k for j [dk] and recalling nj,k

max n 6(b a)2 log T

j,k(ϵ)2 , T0 o , we have

j=1 j,k(I)j,k 1,kn1,k +

j=2 j,k(nj,k nj 1,k)

j=1 j,k T0 + 6(b a)2 log T

1,k 1,k(ϵ)2 +

1 j,k(ϵ)2 1 j 1,k(ϵ)2

We upper bound the last term by:

1,k 1,k(ϵ)2 +

1 j,k(ϵ)2 1 j 1,k(ϵ)2

j,k(ϵ) j+1,k(ϵ)

( j,k(ϵ))2 + dk,k dk,k(ϵ)2

j,k(ϵ) j+1,k(ϵ)

j,k(ϵ) j+1,k(ϵ) + dk,k dk,k(ϵ)2

1 j+1,k(ϵ) 1 j,k(ϵ)

+ dk,k dk,k(ϵ)2 2 dk,k dk,k(ϵ)2 ,

where the first inequality is due to j,k(ϵ) j+1,k(ϵ), and the second upperbounds the telescoping series:

1 j+1,k(ϵ) 1 j,k(ϵ)

= 1 dk,k(ϵ) 1 1,k(ϵ) 1 dk,k(ϵ)

Matroid Semi-Bandits in Sublinear Time

Hence, we derive an upper bound for the first part relevant to (I)j,k:

j=1 j,k(I)j,k

j=1 j,k T0 + 12(b a)2 dk,k log T

Lemma 5.5. Let k / supp (i ) and j [dk]. For T > T0,

Proof of Lemma 5.5: Let ϵ = 1 logm T . Recall

t=1 E h 1 n gt(j) = k, Nk(t) > nj,k oi ,

where nj,k = max 6(b a)2 log T ( µj 1+log m T µk)2 , T0

First, we claim that

gt(j) = k = uk(Nk(t 1), T) mins<t uj(s, t)

1 + ϵ , (14)

where uk(s, t) = µk(s) + q

1.5(b a)2 log t

s and µk(t) = 1

t Pt s=1 yk(s).

Show Eq. (14): Observe that gt(j) = k implies

f k, q dom(f k), q

D dom(f j), q E

where Eq. (8) is used in the first and the last inequality, and the second inequality is due to Lemma 5.2 and Corollary 4.10. By (1 + ϵ

3)2 1 + ϵ and expanding f k = (ˆµk(t 1), 1

Nk(t 1)) and q = (1, p

1.5(b a)2 log t), we have

uk(Nk(t 1), t) uj(Nj(t 1), t)

As log T > log t and Nj(t 1) [t 1], we further derive

uk(Nk(t 1), T) uj(Nj(t 1), t)

1 + ϵ mins<t uj(s, t)

which shows Eq. (14).

Second, let Tj,k = {t {nj,k + 1, , T} : gt(j) = k, Nk(t 1) > nj,k}. From Eq. (14), we derive

t=nj,k+1 P h gt(j) = k, Nk(t 1) > nj,k i

t=nj,k+1 P uk(Nk(t 1), T) mins<t uj(s, t)

1 + ϵ and t Tj,k

s<t P uk(Nk(t 1), T) uj(s, t)

1 + ϵ and t Tj,k

Matroid Semi-Bandits in Sublinear Time

where the last inequality uses union bound.

Third, we now upper bound each term P h uk(Nk(t 1), T) uj(s,t)

1+ϵ and t Tj,k i in Eq. (22). Remind that

uk(Nk(t 1), T) uj(s, t)

1 + ϵ µk(Nk(t 1)) + λT p

Nk(t 1) | {z } At

µj(s) + λt s 1 + ϵ | {z } Bt,s

Define the event Et,s = {At Bt,s and t Tj,k}. We will partition the event Et,s by comparing At to A t = µk+ 2λT

Nk(t 1) and comparing Bt,s to B = µj 1+ϵ as follows:

Et,s n At A t and t Tj,k o µk(Nk(t 1)) µk + λT

Nk(t 1) and t Tj,k

Et,s {Bt,s B and t Tj,k} n µj µj(s) + λt s and t Tj,k o

Et,s {At < A t and Bt,s > B and t Tj,k} µk + 2λT

Nk(t 1) > µj 1+ϵ and t Tj,k

. The inclusion is because

under the event Et,s {At < A t and Bt,s > B and t Tj,k}, we have

Nk(t 1) = A t > At Bt,s > B = µj 1 + ϵ,

where the first and last inequalities are due to the event {At < A t and Bt,s > B and t Tj,k}, and the second inequality is due to the event Et,s = {At Bt,s and t Tj,k}.

Hence, we have the following inclusion:

{At A t and t Tj,k} {Bt,s B and t Tj,k} {At < A t and Bt,s > B and t Tj,k}

= {t Tj,k} {At Bt,s and t Tj,k} = Et,s.

From union bound,

P[Et,s] P h {At A t and t Tj,k} Et,s i + P h {Bt,s B and t Tj,k} Et,s i

+ P h {At < A t and Bt,s > B and t Tj,k} Et,s i

Nk(t 1) µk(Nk(t 1)) and t Tj,k

+ P µj µj(s) + λt s and t Tj,k

Nk(t 1) > µj 1 + ϵ and t Tj,k

In Eq. (23), recall λt = p

1.5(b a)2 log t and observe that the last term

1.5(b a)2 log T

Nk(t 1) µj 1 + ϵ and t Tj,k

1.5(b a)2 log T

nj,k + 1 µj 1 + ϵ

where the inequality is because t Tj,k implies Nk(t 1) nj,k + 1, and the equality is because

nj,k 6(b a)2 log T

( µj 1+ϵ µk)2 = 6(b a)2 log T

nj,k + 1 < µj

Matroid Semi-Bandits in Sublinear Time

and also we have µj 1+ϵ µk > 0 which is ensured by Lemma 5.3 as T > T0. Finally, from Eq. (22) and Eq. (23),

µk(Nk(t 1)) µk +

1.5(b a)2 log T

Nk(t 1) and t Tj,k

1.5(b a)2 log t

s and t Tj,k

µk(t 1) µk +

1.5(b a)2 log T

1.5(b a)2 log t

e 3 log T + e 3 log t ,

where the second inequality is because {Nk(t 1)}t Tj,k is strictly increasing (as Nk(t) = Nk(t 1) + 1 when gt( j) = k) and thus is a subsequence of {nj,k + 1, , T}, and the last inequality is due to an application of Hoeffding s inequality (Lemma E.2) with s = p

1.5(t 1)(b a)2 log T and n = t 1 to bound the first term and with s = p

1.5s(b a)2 log t and n = s to bound the second term. The proof is completed by evaluating

s<t e 3 log T

t T 3 T(T + 1)

s<t e 3 log t

Lemma E.2 (Hoeffding s inequality). Let X1, , Xn be independent random variables such that Xi [a, b] for all i [n]. Then, for all s > 0,

i=1 (Xi E[Xi]) s