# contextual_conservative_interleaving_bandits__f3912435.pdf Contextual Conservative Interleaving Bandits Kei Takemura 1 Abstract The performance of a bandit algorithm is usually measured by the cumulative rewards of the actions chosen by the algorithm. However, in many real-world applications, the rewards in each round should be good enough for reasons such as safety and fairness. In this paper, we investigate the contextual conservative interleaving bandit problem, which has a performance constraint that requires the chosen actions to be not much worse than given baseline actions in each round. This work is the first to simultaneously consider the following practical situations: (1) multiple actions are chosen in a round, (2) the feature vectors associated with given actions depend on the round, and (3) the performance constraints in each round that depend only on the actions chosen in that round. We propose a meta-algorithm, Greedy on Confidence Widths (GCW), that satisfies the performance constraints with high probability. GCW uses a standard bandit algorithm and achieves minimax optimal regret up to logarithmic factors if the algorithm used is also minimax optimal. We improve the existing analyses for the C2UCB algorithm and the Thompson sampling to combine with GCW. We show that these algorithms achieve near-optimal regret when the feasible sets of given actions are the bases of a matroid. Our numerical experiments on a real-world dataset demonstrate that GCW with the standard bandit algorithms efficiently improves performance while satisfying the performance constraints. 1. Introduction The stochastic multi-armed bandit (MAB) problem and its extensions have been studied as sequential decision-making problems under uncertainty. In the MAB problem, a learner iterates the following process T times: The learner chooses 1NEC Corporation, Tokyo, Japan. Correspondence to: Kei Takemura . Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s). an action from given actions and then obtains a reward for the chosen action. The learner aims to maximize the sum of rewards. The central challenge of this problem is the exploration-exploitation trade-off due to the lack of prior knowledge about the rewards. The MAB problem has been generalized to many directions to model more realistic situations of real-world applications. For example, the combinatorial semi-bandit (CS) problem (Gai et al., 2012; Kveton et al., 2014; 2015; Wang & Chen, 2018) enables the learner to choose multiple actions at once and observe corresponding rewards, the contextual linear bandit (CLB) problem (Abe & Long, 1999; Auer, 2002; Abbasi-Yadkori et al., 2011; Chu et al., 2011) introduces time-dependent actions associated with feature vectors, and the contextual combinatorial semi-bandit (CCS) problem (Qin et al., 2014; Takemura & Ito, 2019; Takemura et al., 2021) generalizes both the CS and the CLB problems. In recent years, the conservative bandit problems (Wu et al., 2016; Kazerouni et al., 2017; Garcelon et al., 2020; Khezeli & Bitar, 2020; Moradipari et al., 2020; Katariya et al., 2019) have been studied, where constraints on the learner s performance are introduced. Specifically, the learner additionally observes baseline actions for given actions and has to satisfy performance constraints that require that the learner s performance is not much worse than the performance by the baseline actions. The performance constraints mitigate a drawback that the standard bandit algorithms perform poorly in early rounds. Unfortunately, these studies are not yet suitable for many real-world applications. In real-world applications such as recommendations, the following properties summarized in Table 1 should often be satisfied simultaneously: (1) the learner chooses multiple actions at once, (2) the feasible and the baseline actions depend on the round, and (3) the performance constraints guarantee the performance of each round. However, none of the existing studies addresses all of these properties. In Section 2, we will compare this work with the existing studies on the conservative bandit problems in detail. This paper investigates the contextual conservative interleaving bandit (CCIB) problem, which satisfies the three properties above. Specifically, the CCIB problem is the CCS problem with stage-wise performance constraints. As Contextual Conservative Interleaving Bandits Table 1. Conservative bandit problems. Algorithm Combinatorial action set Time-dependent feasible and baseline actions Stage-wise constraints Conservative UCB (Wu et al., 2016) CLUCB (Kazerouni et al., 2017) CLUCB2 (Garcelon et al., 2020) SEGE (Khezeli & Bitar, 2020) SCLUCB & SCLTS (Moradipari et al., 2020) i UCB (Katariya et al., 2019) Algorithm 1 (This work) in previous studies on bandit algorithms, we measure the performance of an algorithm by its regret, which is the difference between the sum of the rewards of the optimal actions and that of the algorithm s actions. Our main contribution is to propose an algorithm that is minimax optimal up to logarithmic factors. Specifically, the proposed algorithm achieves, with probability at least 1 δ, e O(min( dk T, d/ )+dk+min(k p d T/n, dk/(n ))) regret while satisfying the constraints, where d is the dimension of the feature vectors, k is the number of actions to be chosen in a round, T is the number of rounds, n is a parameter that controls how conservative the algorithm is, is a sub-optimality gap, and e O( ) ignores the logarithmic factors in d, k, T, and 1/δ. To show the proposed algorithm is almost optimal, we show matching lower bounds by reducing the CCIB problem to the conservative MAB problem. The proposed algorithm is a meta-algorithm using an algorithm for the CCS problem. Our algorithm interleaves the baseline actions with the actions that the given algorithm recommends to satisfy the performance constraints. The regret by the proposed algorithm can be represented as the sum of the regret due to choosing the baseline actions and that due to choosing the given algorithm s actions. In terms of these high-level ideas, our algorithm and analysis are similar to the i UCB algorithm (Katariya et al., 2019) and its regret analysis for the conservative interleaving bandit (CIB) problem, respectively.1 However, we cannot use the analysis for the i UCB algorithm to bound the regret of choosing the baseline actions due to the time-dependent actions. The time-dependent actions make it difficult to bound the regret suffered by the baseline actions. This regret can be bounded by the widths of the confidence intervals of the reward estimates for the baseline and the given algorithm s actions. Since the feasible sets of actions are fixed in the 1While the i UCB algorithm is not a meta-algorithm, we consider the i UCB algorithm to be a combination of a meta-algorithm and a UCB-type algorithm for ease of presentation. CIB problem, the i UCB algorithm can choose all of the baseline and the given algorithm s actions over multiple rounds. Thus, we can use a standard analysis for UCB-type algorithms to bound the regret. However, we cannot take this approach for the CCIB problem since a feature vector in a round may never appear in future rounds. Therefore, it is necessary to bound the confidence widths for the actions that are not chosen by the proposed algorithm since the proposed algorithm cannot choose all the actions of the baseline and the given algorithm s actions. To overcome this difficulty, the proposed algorithm preferentially chooses actions with large confidence widths from the actions of the baseline and the given algorithm. This technique allows us to bound the confidence widths for the actions that are not chosen by the proposed algorithm using those for the the proposed algorithm s actions. Consequently, we can use existing analyses to bound the regret due to satisfying the performance constraints. Since the proposed algorithm chooses the actions with large confidence widths, a sophisticated analysis is needed to bound the sum of these confidence widths. To the best of our knowledge, our analysis is the first to bound the regret by the confidence widths for the actions not chosen by the algorithm. We believe that our technique is useful for sibling problems. We consider the C2UCB algorithm (Qin et al., 2014) and the (round-wise) Thompson sampling (Takemura & Ito, 2019) as the algorithm that the proposed algorithm uses. We show that these algorithms achieve near-optimal regret bounds for the CCS problem with the feasible actions characterized by any matroid. Moreover, we show the first gap-dependent regret bound for the C2UCB algorithm. Specifically, the C2UCB algorithm and the Thompson sampling achieve e O(min( dk T, d/ )+dk) and e O(d3/2 k T +dk) regrets, respectively. The first term of the regret bound of the Thompson sampling has a gap from the optimal bound. However, we can also see this gap in the regret bound of Thompson sampling for the CLB problem (Agrawal & Goyal, 2013; Abeille & Lazaric, 2017). Note that the proposed algorithm is minimax optimal if it uses the C2UCB algorithm. Contextual Conservative Interleaving Bandits We evaluate the proposed algorithm through numerical experiments on a real-world dataset. We conduct our experiments in two cases. One is the case where the feasible and the baseline actions are fixed across the rounds. The other is the case where the given actions depend on the round. In the first case, we compare the proposed algorithm with the i UCB algorithm and its variant that uses the feature vectors to estimate the rewards and the confidence intervals. We also compare our algorithm with the baseline actions of the proposed algorithm and some standard bandit algorithms in both cases. In the first case, the proposed algorithm outperforms other algorithms except for the variant of the i UCB algorithm. The variant of the i UCB algorithm is compatible with the proposed algorithm in terms of regret. However, the variant of the i UCB algorithm is unstable in terms of performance constraints. In the second case, the proposed algorithm behaves similarly to the first. In other words, we observe that our algorithm is robust to the changes in the given actions. 2. Related Work 2.1. Conservative Bandits The concept of the conservative bandit problems was proposed by Wu et al. (2016). They studied the MAB problem with constraints that in each round, the learner has to satisfy that the cumulative reward until this round is not much worse than the cumulative reward by the baseline. This work was generalized to the CLB problem (Kazerouni et al., 2017; Garcelon et al., 2020). While these studies consider the constraints on cumulative rewards, the performance constraints should guarantee the performance of each round in real-world applications. For instance, let us consider a recommender system. If the system recommends items that do not match the target user s preferences at all due to exploration by the bandit algorithms, this user may never use the system again. Furthermore, it is unfair to recommend unfavorable items to one user due to the exploration but favorable items to another. Khezeli & Bitar (2020) and Moradipari et al. (2020) introduced stage-wise constraints that consider the rewards of each round to the linear bandit problem with a timeindependent convex action set. While the stage-wise constraints solve the performance issue in early rounds, these studies strongly rely on the assumption that the action set is time-independent and convex. In other words, we cannot apply these studies to many real-world applications (e.g., the recommender system considered above). Katariya et al. (2019) studied the CIB problem, which is the CS problem with baseline actions and stage-wise performance constraints. They simultaneously handled the non-convex action set and the stage-wise constraints by in- troducing the problem in which the learner chooses multiple actions in a round. Moreover, the learner must often choose multiple actions simultaneously in real-world applications. Note that the CIB and CCIB problems denote the problems of maximizing the sum of rewards, while the CS and CCS problems cover some classes of non-linear functions of the rewards. While the CIB problem assumes that the feasible and the baseline actions do not depend on the round, the feasible and the baseline actions often depend on the round in realworld applications. In a news article recommendation, for example, news articles would change over the rounds as old news articles are deleted or the latest news articles are added (Li et al., 2010; Wang et al., 2017). The given actions correspond to the news articles in this example. Note that the feasible and the baseline actions could depend on the round, even if the given actions are fixed over the rounds. 2.2. Contextual Combinatorial Semi-Bandits Qin et al. (2014) proposed the CCS problem and the C2UCB algorithm. Takemura & Ito (2019) and Takemura et al. (2021) improved the regret bound by the C2UCB algorithm. In particular, Takemura et al. (2021) showed that the C2UCB algorithm is minimax optimal up to logarithmic factors when the partition matroid characterizes the feasible actions. Our analysis of the C2UCB algorithm is an extension of the analysis by Takemura et al. (2021). Takemura & Ito (2019) proposed Thompson sampling algorithms for the CCS problem was proposed and showed that this algorithm achieves e O(max( k T) regret. Unlike the C2UCB algorithm, the Thompson sampling algorithms in the CCS problem with the feasible actions characterized by matroids have not been studied. In addition, a gap-dependent bound of the Thompson sampling algorithms for the CCS problem is not known. 3. Problem Setting 3.1. Contextual Conservative Interleaving Bandits We formally define the CCIB problem. This problem consists of T rounds, and a learner chooses a set of k actions from a given family of action sets in each round. Each action, called arm, is associated with a d-dimensional feature vector. Let N denote the number of given arms, and each arm is indexed by an integer in [N] := {1, 2, . . . , N}. Note that the learner knows the above parameters in advance. The learner progresses through each round as follows. At the beginning of the t-th round, a learner observes the feature vectors {xt(i)}i [N] Rd of the arms. In addition, the learner observes a family St 2[N] of feasible sets of arms and a feasible set Bt St called baseline arms. Then, the Contextual Conservative Interleaving Bandits learner chooses a feasible set It St. We assume that each feasible set of arms is of size k, i.e., I St, |I| = k. At the end of the round, the learner obtains the rewards {rt(i)}i It, where for all i It, rt(i) = θ xt(i) + ηt(i) for some unknown parameter θ Rd and a random noise ηt(i) R. We will introduce the assumption on ηt(i) in Section 3.2. We evaluate the performance of the learner by the regret R(T), which is defined as i I t µt(i) X where µt(i) = θ xt(i) for all i [N] and t [T], and I t argmax I St P i I µt(i) for all t [T]. The learner aims to minimize the regret, which is equivalent to maximizing P i It µt(i). Compared to the standard bandit problems, the CCIB problem additionally introduces stage-wise performance constraints as in the CIB problem (Katariya et al., 2019). These constraints require that chosen arms performance in each round should not be much worse than the baseline arms performance of the round. More precisely, with high probability, there exists a bijection ρt: It Bt such that X i It 1(µt(i) < µt(ρt(i))) m (1) for all t [T], where m is a non-negative parameter that represents how It should be conservative.2 Our constraint allows aggressive explorations when m is large. Note that choosing Bt satisfies this constraint. Several existing studies (Wu et al., 2016; Kazerouni et al., 2017; Garcelon et al., 2020; Moradipari et al., 2020) define a performance constraint using the sum of obtained rewards. An adaptation of this constraint to our problem is that X i It µt(i) (1 α) X i Bt µt(i) (2) for a fixed α [0, 1] for all t [T], with high probability. In general, this constraint is not directly comparable to our constraint. We show this fact in Appendix B.1. However, we show that constraint (1) implies constraint (2) under a reasonable assumption, which will be discussed in Section 3.2. 3.2. Assumptions We introduce several assumptions on the CCIB problem and define a few parameters of the problem. 2We assume that the parameters k and m are time-independent for simplicity. However, our algorithm and analysis can be easily extended to the time-dependent parameters. First, we discuss the assumptions of the feature vectors, the unknown parameter, the rewards, and their noises. Recall that µt(i) = θ xt(i) and rt(i) = µt(i) + ηt(i). We assume the following standard assumptions in the CCS problem (Qin et al., 2014; Takemura & Ito, 2019; Takemura et al., 2021) and the conservative bandit problems (Wu et al., 2016; Kazerouni et al., 2017; Garcelon et al., 2020): Assumption 3.1. The feature vectors {xt(i)}i [N] are determined by oblivious adversary. Assumption 3.2. The learner knows θ 2 M and xt(i) 2 L for all i [N] and t [T]. Assumption 3.3. The mean reward µt(i) satisfies µt(i) [0, 1] for all i [N] and t [T]. Assumption 3.4. The noise sequence {ηt(i)}i It,t [T ] is conditionally R-sub-Gaussian, i.e, E exp(ληt(i)) | {xs(j)}j Is,s [t], {ηs(j)}j Is,s [t 1] exp(λ2R2/2) for all λ R, i It and t [T]. Note that Assumption 3.4 implies that the mean of ηt(i) is zero for all i It and t [T]. Next, we define the requirements for the algorithm used by the proposed meta-algorithm. Assumption 3.5. The algorithm used by the proposed metaalgorithm works for the CCS problem (i.e., the CCIB problem without the performance constraints). The algorithm has an internal model for the estimation of rewards. In a round t, the algorithm runs as follows: 1. It estimates the rewards of given arms based on observed feature vectors. 2. It chooses arms by solving argmax I St r t(i), where {r t(i)}i [N] is estimation of the rewards. 3. It plays the chosen arms and updates the internal model using the feature vectors, the chosen arms, and the observed rewards. Furthermore, the algorithm can retry the second step with another set S t of feasible actions as long as it is before executing the third step. Note that the standard bandit algorithms such as the UCBtype algorithms, the Thompson sampling, and the (ε-)greedy algorithm satisfy Assumption 3.5. Next, we discuss the family of feasible sets of arms, i.e., St. As in the CIB problem (Katariya et al., 2019), we focus on exchangeable set: Contextual Conservative Interleaving Bandits Definition 3.6 (Exchangeable set). Given a set E, a family S 2E is exchangeable if for any two sets I1, I2 S, there exists a bijection ρ : I1 I2 such that J I1, (I1 \ J) {ρ(i) | i J} S. (3) Assumption 3.7. The family of feasible sets of arms, St, is exchangeable for all t [T]. Assumption 3.8. For any t [T] and I1, I2 St, the learner knows a bijiection ρ : I1 I2 that satisfies (3). The set E in Definition 3.6 corresponds to [N] in our problem. Note that if St is the bases of a uniform matroid or a partition matroid, it is known how to construct the bijection that satisfies (3) (Katariya et al., 2019). We discuss how large the class of exchangeable sets is. Katariya et al. (2019) pointed out that the set of the bases of a strongly base-orderable matroid is exchangeable, but the converse remains an open question. This paper solves this question in the affirmative, i.e., we show that every exchangeable set is the set of the bases of some strongly base-orderable matroid. We defer our proof to Appendix C. It is known that the strongly base-orderable matroid includes the uniform matroid, the partition matroid, and the transversal matroid. Next, we discuss the lower bound of the parameter m. Intuitively, small m makes the CCIB problem difficult. In fact, when m = 0, no algorithm can achieve sub-linear regret while satisfying our performance constraint. We defer our proof to Appendix E.1. To obtain sub-linear regret, we assume the following: Assumption 3.9. The parameter m satisfies m 1. Finally, we introduce a reasonable assumption for a reduction from the constraint (2) to constraint (1). We defer our proof to Appendix B.2. Assumption 3.10. Let rℓ= mint [T ] mini Bt µt(i). Then, the parameters k, m, α and rℓsatisfy αrℓ m/(k m). Note that we do not need Assumption 3.10 to satisfy constraint (1). Since we have a reduction, we will focus on constraint (1) in the rest of this paper. 4. Proposed Algorithm In this section, we propose an algorithm for the CCIB problem. Our algorithm, described in Algorithm 1, is the first to deal with time-dependent feature vectors and stage-wise performance constraints, as discussed in Section 1. Since our algorithm is a meta-algorithm, it requires a base algorithm A for the CCS problem as input. At the beginning of the t-th round, our algorithm observes the feature vectors {xt(i)}i [N], feasible sets St, and baseline arms Bt (line 3). Then, our algorithm receives the arms Algorithm 1 Greedy on confidence widths Input: λ > 0, δ (0, 1), n N, and an algorithm A. 1: V0 λI and b0 0. 2: for t = 1, 2, . . . , T do 3: Observe {xt(i)}i [N], St, and Bt. 4: ˆIt argmax I St P i I r t(i), where {r t(i)}i [N] is the estimation of rewards by A. 5: Define rt(i) for all i [N] according to (4). 6: It argmax I St P 7: Let ρt : It ˆIt be the bijection that satisfies (3). 8: I0 t It and J0 t . 9: for ℓ= 1, 2, . . . , n do 10: iℓ t argmaxi Iℓ 1 t max(ct(i), ct( ρt(i))). 11: Iℓ t Iℓ 1 t \ {iℓ t}. 12: jℓ t argmaxj {iℓ t, ρt(iℓ t)} ct(j). 13: Jℓ t Jℓ 1 t {jℓ t}. 14: end for 15: Choose It St such that |It \ It| n, Jn t It ˆIt It, and i It if and only if ρt(i) / It for all i It \ ˆIt. 16: Play It St and observe the rewards {rt(i)}i It. 17: Feed ˆIt It and the corresponding rewards to A, and update A as if it has chosen ˆIt It. 18: Vt Vt 1 + P i It xt(i)xt(i) . 19: bt bt 1 + P i It rt(i)xt(i). 20: end for ˆIt that the base algorithm A recommends (line 4). In the following and our analysis, we call ˆIt recommended choice. Then, the proposed algorithm starts to interleave the recommended arms with the baseline arms. At the first step of our interleaving, the proposed algorithm obtains safe arms It such that there exists a bijection ρt : It Bt that satisfies µt(i) µt(ρt(i)) for all i It with high probability. To obtain the safe arms, we define the following rewards and solve the maximization problem (lines 5 and 6): ˆrt(i) if i Bt ˇrt(i) if i ˆIt \ Bt otherwise , (4) where ˆrt(i) = ˆθ t xt(i) + ct(i), ˇrt(i) = ˆθ t xt(i) ct(i), ˆθt = V 1 t 1bt 1, ct(i) = βt(δ) q xt(i) V 1 t 1xt(i), and βt(δ) = e O( p min(log(N), d) + M λ) for all i [N]. The formal definition of βt(δ) is described in Appendix D.1. Note that It ˆIt Bt because Bt is a feasible solution with a finite objective value. Note also that since St is the bases of a matroid, this optimization problem can be solved efficiently by the greedy algorithm (Korte & Vygen, 2018). Then, the proposed algorithm chooses the arms from ˆIt It for the exploration. Our algorithm needs a positive integer Contextual Conservative Interleaving Bandits n, which controls the maximum number of arms for the exploration. The algorithm preferentially chooses n arms with large confidence widths as Jn t (lines 8 14). Then, the algorithm chooses the playing arms It that includes Jn t (line 15). Note that one can choose any set of arms that satisfies all conditions in line 17 of Algorithm 1. For example, the set ( It\{j Jn t (ˆIt\ It) | ρ 1 t (j)}) (Jn t (ˆIt\ It)) satisfies the conditions. If |Jn t (ˆIt \ It)| < n and ˆIt \(Jn t It) = , we can further exchange the arms in ˆIt \ (Jn t It) for the corresponding baseline arms. At the end of the t-th round, the proposed algorithm and the base algorithm A update their reward estimation by the observed rewards (lines 17 19). This update of A can be seen as a retry of choosing arms by A. In fact, it holds that ˆIt It argmax I S t P i [N] r t(i) for some S t (see our proof of Lemma 5.2 for details). Therefore, the base algorithm A runs on an instance of the CCS problem. In other words, we can use existing regret bounds of A to bound the regret by ˆIt It. To bound the regret by the proposed algorithm, the central part of our analysis (and the analysis by Katariya et al. (2019)) is to bound the regret by {It\ ˆIt}t [T ], i.e., bounding the penalty by satisfying our performance constraint. We call the set It\ ˆIt conservative choice. Note that It\ ˆIt Bt. The regret by the conservative choices can be bounded by the confidence widths for the conservative choices and those for the recommended choices. For the CIB problem (i.e., {xi(i)}i [N], St, and Bt are fixed over the round)3, Katariya et al. (2019) showed that the confidence widths by the i UCB algorithm for the conservative choices can be bounded by those for the recommended choices. Since St fixed, the i UCB algorithm chooses the conservative and the recommended arms using multiple rounds and it is possible to use a standard analysis for UCBtype algorithms to bound the regret. However, we cannot use this approach for the CCIB problem because of the contextual setting. First, we cannot bound the confidence width for the conservative choices by that for the recommended choices since the baseline arms may change. Second, our algorithm may not be able to choose the recommended choices in other rounds because these feature vectors may not appear in other rounds. To solve these problems, our algorithm chooses Jn t . As discussed in Section 1, choosing Jn t enables us to bound the confidence widths for the arms that are not chosen by our algorithm using those for the playing arms It. Then, we can utilize the existing analyses for the standard bandit algorithms. We formally show this fact in the next section. 3Though Katariya et al. (2019) dose not consider the case that the feature vectors associated with the arms, the i UCB algorithm can be applied to the problem with time-independent feature vectors by ignoring the feature vectors. 5. Regret Analysis In this section, we introduce and prove our regret bound of the proposed algorithm. Let S t = argmax I St P i I µt(i). Throughout our analysis, we fix I t S t arbitrarily for all t [T] and assume that t [T], St = S t for ease of presentation. Note that we have no regret when St = S t . Then, we define the following suboptimality gaps:4 t(i, j) = µt(j) µt(i) and = mint [T ],I St\S t ,i I,i I t : t(i,i )>0 t(i, i ). Using our gap , we introduce our regret bound: Theorem 5.1. Let κ = min(log(N), d), ν = log(k T) and ι = min(log(Nk T/δ), d log(k T/δ)). If λ = (R/M)2κ and n m, with probability at least 1 δ, Algorithm 1 satisfies performance constraint (1) for all t [T] and achieves the following regret bound: R(T) = RA(T) + O min dk Tνι, dνι/ + dkν d Tνι/n, dkνι/(n ) , where RA(T) is the regret bound of the base algorithm A for the CCS problem. Our regret bound consists of the regret bound of the base algorithm A and the penalty for satisfying the performance constraints. We will show in Section 6 that the penalty term is minimax optimal up to logarithmic factors. Furthermore, the penalty term matches the regret bound of the i UCB algorithm by Katariya et al. (2019) when we apply the proposed algorithm to the CIB problem. In the remainder of this section, we prove Theorem 5.1 and the regret bounds for the two algorithms used as the base algorithm A. We defer all missing proofs to Appendix D. Recall that ˆIt is the recommended choice (line 4 of Algorithm 1), and It \ ˆIt is the conservative choice. Let ˆρ t : ˆIt I t be the bijection satisfying (3). Let ρ t : It I t denote ρ t (i) = ˆρ t (i) if i ˆIt and ρ t (i) = ˆρ t ( ρt(i)) otherwise. Let Rt(i) = µt(ρ t (i)) µt(i) for i It \ ˆIt. Our analysis decomposes the regret as follows and bounds them separately: i It ˆIt Rt(i) + X i It\ˆIt Rt(i) The former is the regret of the recommended choices, and the latter is the regret of the conservative choices. We have the following lemma for the regret by the recommended choices: 4If the feature vectors are fixed over the rounds, we can define alternative sub-optimality gaps e,min and e ,min by Katariya et al. (2019). In this case, we have = min( e,min, e ,min). Contextual Conservative Interleaving Bandits Lemma 5.2. We have P i It ˆIt Rt(i) RA(T). As a common tool for our analysis, we bound the estimation errors of the rewards as in previous work (Qin et al., 2014; Takemura & Ito, 2019; Takemura et al., 2021). Let x A denote x Ax for all x Rd and A Rd d such that A is positive definite. Then, we have the following: Lemma 5.3. Suppose that δ (0, 1). We have, with probability at least 1 δ, for all t [T] and i [N], |µt(i) ˆθ t xt(i)| βt(δ) xt(i) V 1 t 1. (6) In our proof of Theorem 5.1, we assume that (6) holds. 5.1. Regret by Conservative Choices Similar to the analysis by Katariya et al. (2019), we can bound the regret of an arm in the conservative choices by the sum of the confidence widths for that arm and the corresponding arm in the recommended choices: Lemma 5.4. For any t [T] and any i It \ ˆIt, we have µt(ρ t (i)) µt(i) 2(ct(i) + ct( ρt(i))). Since the confidence widths of Jn t is larger than those of ( It ˆIt) \ Jn t , we can bound the regret by the sum of the confidence widths of Jn t . However, to bound such large confidence widths, we need two additional theoretical tools. First, we show that the number of times the confidence widths for Jn t are greater than the sub-optimality gap can be bounded. Recall that jℓ t is defined in line 15 of Algorithm 1 for all ℓ [n]. Lemma 5.5. Let γℓ= ℓ /(4βT (δ)) for all ℓ [n]. Then, for any ℓ [n], we have P t [T ] 1(ct(jℓ t) /4) 2d log(1 + L2k T/(dλ))/ min(γ2 ℓ, 1). Second, we show that the conservative choice suffers no regret if their confidence widths are smaller than the suboptimality gap. Recall that ˆρ t : ˆIt I t and ρt(i) : It ˆIt are the bijections satisfying (3). Lemma 5.6. For any i It, if ct(i)+ct( ρt(i)) < /2, we have µt(i) = µt( ρt(i)) = µt(ˆρ t ( ρt(i))). We are ready to bound the regret due to the conservative choices. The regret can be rewritten as X i It\ˆIt Rt(i) = X i It\(ˆIt Jn t ) Rt(i) (7) i Jn t \ˆIt Rt(i). (8) First, we bound the regret due to the right-hand side of (7). Let I t = It \(ˆIt Jn t ). Since βT (δ) βt(δ) for all t [T] and ct(i) + ct( ρt(i)) 2ct(jn t ) for all It \ Jn t and t [T], by Lemma 5.4 and Lemma 5.6, we obtain X i I t Rt(i) X i I t Rt(i) + X i I t Rt(i) t Φ xt(jn t ) V 1 t + k|Ψ|, where Φ = {t [T] | ct(jn t ) 4 , xt(jn t ) 2 e V 1 t 1,n 1 Ψ = {t [T] | xt(jn t ) 2 e V 1 t 1,n > 1 n}, and e Vt,ℓ= λI + P j Jℓ t xs(j)xs(j) for all ℓ [n] and t [T]. We consider the rounds in Φ. Let ξ = log(1 + L2k T/(dλ)). Then, we have t Φ xt(jn t ) V 1 t 1 1 j Jn t min 1 n, xt(j) e V 1 t 1,n v u u tn|Φ| X j Jn t min 1 n, xt(j) 2 e V 1 t 1,n n , 2d n min(γn, 1)ξ where the first inequality is derived from the facts that Vt e Vt,n and ct(jn t ) ct(jℓ t) for all ℓ [n] and t [T], the second inequality holds by the Cauchy-Schwarz inequality, and the last inequality is obtained by Lemma A.3 (Lemma 5 by Takemura et al. (2021)) and Lemma 5.5. For the rounds in Ψ, by a similar discussion above, we have k|Ψ| k j Jn t 1( xt(j) 2 e V 1 t 1,n > 1 where the last inequality follows from Lemma A.3. Next, we bound the regret due to (8). Using Lemma 5.4, we have Rt(j) 2βt(δ) xt(j) V 1 t 1 for all j Jn t and t [T]. Thus, by a similar analysis for the C2UCB algorithm, we have the following bound (see Appendix D.7 for details): P t [T ] P i Jn t \ˆIt Rt(i) = e O(min( dn Tνι, dνι Finally, combining the above discussions, we obtain the desired result. 5.2. Performance Guarantee The rest of our proof of Theorem 5.1 is to show that the set {It}t [T ] of the playing arms satisfies our performance constraint (1) with high probability. To show this performance guarantee, we bound the number of arms to be chosen for the exploration. To prove the desired result, we use the following lemma: Lemma 5.7. Let E be the ground set of a matroid, A be the set of the bases of the matroid, and A A be a basis. Let r : E R be a reward function and A be a basis such that A argmax I A P i I r(i). Then, there exists Contextual Conservative Interleaving Bandits a bijection ρ : A A such that r(i) r(ρ(i)) for all i A . In addition, we have ρ(i) = i for all i A A. We fix t [T] arbitrarily. Let ρ t : It Bt be a bijection by Lemma 5.7 with { rt(i)}i [N]. Then, since (6) holds, for any i It, we have µt(i) rt(i) rt(ρ t(i)) µt(ρ t(i)). Hence, it is sufficient to compare It with It. By the construction of It, we always have |It \ It| n. This fact and n m imply that It satisfies the constraint. 5.3. Near-Optimal Regret Bounds for the CCS Problem In this subsection, we briefly introduce our analysis for the C2UCB algorithm and the Thompson sampling. A full proof is provided in Appendix G and Appendix H. Note that we consider the CCS problem and the definition of regret is the same as the regret of the CCIB problem. The key component of our proof is the bijection ρ t : It I t by Lemma 5.7 with the reward estimates. These estimates {r t(i)}i [N] correspond to the upper confidence bounds and the rewards calculated by the sampled parameter in the C2UCB algorithm and the Thompson sampling, respectively. Recall that Rt(i) = µt(ρ (i)) µt(i). Then, we can decompose the regret as follows: i It\Jt Rt(i) + P i Jt Rt(i) , where Jt = {i It | xt(i) V 1 t 1 1/ k}. The former term can be bounded by a variant of the existing analyses because we have r t(i) r t(ρ t (i)) for all i It. The latter term can be bounded by the fact that P t [T ] |Jt| = e O(dk). 6. Lower Bounds In this section, we show the following theorem: Theorem 6.1. Suppose that there exist an algorithm and some m k/(4e + 2) such that the algorithm satisfies the constraint (1) in expectation for any environment. Then, for any sub-optimality gap q d 1 4m T , q some environment such that E [RT ] > (d 1)k (32e+16)m . We defer the omitted proofs in this section to Appendix E.2. Since substituting = Θ( p d/(m T)) leads to a gapindependent bound, we focus on the gap-dependent bound. To prove Theorem 6.1, we consider instances of the CCIB problem constructed by these of the conservative MAB (CMAB) problem. More precisely, we consider k rounds of the CMAB problem as a round of the CCIB problem. Using the standard basis {ei}i [d] as the feature vectors {xt(i)}i [d], the CCIB problem reduces to the d-armed CMAB problem. To represent k rounds of the CMAB problem as a round of the CCIB problem, it is sufficient to prepare dk arms as xt(i + dj) = ei for i [d] and j [k]. We show that an algorithm for the CCIB problem can be applied to the CMAB problem. The algorithm knows the baseline arms of all rounds in advance because the baseline arm of the CMAB problem is fixed over the rounds. The algorithm can choose arbitrary k arms in a round. Then, one can play the chosen arms in any order in the CMAB problem. Finally, the algorithm observes k rewards and proceeds to the next round. The remainder of our proof of Theorem 6.1 is to obtain a lower bound of the CMAB problem. Using a variant of the technique by Wu et al. (2016) for the lower bound, we obtain the following bound: Theorem 6.2. Let RCMAB T denote the regret of the N-armed CMAB problem with T rounds. Let ibase be the baseline arm and it be the arm chosen in round t by an algorithm. Assume that some algorithm and some positive integer k satisfy that E h P t [kt] 1 r(ibase) > r(it ) i αkt for all positive integer t such that kt T for any environment. Then, for any sub-optimality gap hq N 1 4αT , q 4αk i , there is some environment such that E RCMAB T > N 1 (32e+16)α . Substituting T = k T and α = m/k in Theorem 6.2, we finish the proof of Theorem 6.1. We discuss our lower bounds. In contrast to our highprobability upper bound, our lower bounds assume that the algorithm satisfies the performance constraints in expectation. However, these lower bounds cover the proposed algorithm. In fact, the proposed algorithm satisfies the performance constraints in expectation if n = m/2 and δ = min(n/k, 1/T). Note that the expected regret, in this case, has the same regret bound in Theorem 5.1 up to logarithmic factors. Therefore, our lower bounds imply that the proposed algorithm is almost optimal. 7. Numerical Experiments In this section, we experimentally evaluate the performance of the proposed algorithm compared to some existing algorithms using the Book-Crossing dataset (Ziegler et al., 2005). We describe the details of the experimental setup in Appendix F. We conducted our experiments in two settings. First, we considered the case where the feature vectors, the feasible sets of given arms, and the baseline arms are fixed over the rounds. Second, we considered the case where these three components change two times. Our experiments used the proposed algorithm combined with the C2UCB algorithm (Qin et al., 2014) (GCW + C2UCB) and that combined with the Thompson sampling (Takemura & Ito, 2019) (GCW + TS). We compared the proposed algorithm with the weighted regularized matrix factorization (Hu et al., 2008) as the baseline, the ε-greedy Contextual Conservative Interleaving Bandits 0 200 400 600 800 1000 Round Per-step regret Baseline Epsilon-greedy C2UCB TS i UCB Lin-i UCB GCW + C2UCB GCW + TS 0 200 400 600 800 1000 Round Per-step regret Baseline Epsilon-greedy C2UCB TS GCW + C2UCB GCW + TS 0 10 20 30 40 50 Round Smallest m in constraint (1) Epsilon-greedy C2UCB TS i UCB Lin-i UCB GCW + C2UCB GCW + TS 0 200 400 600 800 1000 Round Smallest m in constraint (1) Epsilon-greedy C2UCB TS GCW + C2UCB GCW + TS Figure 1. Average of the per-step regret (i.e., R(t)/t) (top) and average of the smallest m satisfying our performance constraint (1) (bottom). The error bar represents the standard error. Left: the case where {xt(i)}i [N], St, and Bt are fixed. Right: the case where {xt(i)}i [N] and Bt change two times. algorithm, the C2UCB algorithm, and the Thompson sampling in both settings. We additionally compared the i UCB algorithm (Katariya et al., 2019) and its variant (Lin-i UCB algorithm) when they can be applied (i.e., in the first case). The Lin-i UCB used the same reward estimates and confidence widths as those by the proposed algorithm. Figure 1(left) shows the experimental results in the first case. The proposed algorithms outperformed the standard bandit algorithms and the i UCB algorithm. Specifically, the proposed algorithms suffered small regrets in early rounds and outperformed the baseline because of our interleaving technique. The standard bandit algorithms needed large m to satisfy the constraints in early rounds, while these algorithms improved faster than the conservative algorithms by the exploration without the performance constraints. The regret by the Lin-i UCB algorithm was smaller than that by GCW + TS and is compatible with the regret by GCW + C2UCB. However, the Lin-i UCB algorithm was unstable in terms of performance constraints. Figure 1 (right) shows that the proposed algorithms outperformed the existing ones and the baseline in the second case. In addition, GCW + C2UCB outperformed GCW + TS as in the first case. Compared to the results of the first case, all algorithms improved more slowly due to the changing environment. We can see from the numerical results on performance constraints that the quality of recommendations by the bandit algorithms deteriorates temporarily when the environment changes. However, the degradation of the proposed algorithms are smaller than that of other algorithms. This fact suggests that our algorithms are robust to the changes in the given actions. 8. Conclusion We investigated the CCIB problem, which is the CCS problem with stage-wise performance constraints. We proposed the first meta-algorithm for this problem, which preferentially chooses arms with large confidence widths in the baseline arms and those chosen by a given algorithm for the CCS problem. We showed that the proposed algorithm achieves e O(min( dk T, d/ )+dk+min(k p d T/n, dk/(n ))) regret while satisfying the constraints. We also showed a gapdependent and a gap-independent lower bounds through a reduction to the conservative MAB problem. These lower bounds imply that the proposed algorithm achieves minimax optimal up to logarithmic factors. Our numerical experiments demonstrated that the proposed algorithm improves more efficiently than the i UCB algorithm and outperforms existing algorithms. There are several open questions about the CCIB problem and its extensions. First, an efficient algorithm for the learner to obtain a bijection that satisfies (3) has not been obtained. Second, it is open whether a similar regret bound can be obtained for the CCIB problem with feasible arms characterized by any matroid. Note that we cannot use the exchange property (3) in this case. Finally, this paper does not consider the problem where the reward of the learner is a non-linear function with respect to the rewards of arms chosen by the learner. Acknowledgements Kei Takemura would like to thank Shinji Ito and Tatsuya Matsuoka for helpful discussions on matroids. Contextual Conservative Interleaving Bandits Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, volume 24, pp. 2312 2320, 2011. Abe, N. and Long, P. M. Associative reinforcement learning using linear probabilistic concepts. In Proceedings of the Sixteenth International Conference on Machine Learning, pp. 3 11, 1999. Abeille, M. and Lazaric, A. Linear thompson sampling revisited. Electronic Journal of Statistics, 11(2):5165 5197, 2017. Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the Thirtieth International Conference on International Conference on Machine Learning, pp. 127 135. PMLR, 2013. Auer, P. Using confidence bounds for exploitationexploration trade-offs. Journal of Machine Learning Research, 3(Nov):397 422, 2002. Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214, 2011. Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multiarmed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20(5): 1466 1478, 2012. Garcelon, E., Ghavamzadeh, M., Lazaric, A., and Pirotta, M. Improved algorithms for conservative exploration in bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3962 3969, 2020. Hu, Y., Koren, Y., and Volinsky, C. Collaborative filtering for implicit feedback datasets. In 2008 IEEE International Conference on Data Mining, pp. 263 272, 2008. Katariya, S., Kveton, B., Wen, Z., and Potluru, V. K. Conservative exploration using interleaving. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pp. 954 963, 2019. Kaufmann, E., Capp e, O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 (1):1 42, 2016. Kazerouni, A., Ghavamzadeh, M., Abbasi-Yadkori, Y., and Van Roy, B. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, volume 30, pp. 3910 3919, 2017. Khezeli, K. and Bitar, E. Safe linear stochastic bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 10202 10209, 2020. Korte, B. and Vygen, J. Combinatorial Optimization: Theory and Algorithms. Springer, 6th edition edition, 2018. Kveton, B., Wen, Z., Ashkan, A., Eydgahi, H., and Eriksson, B. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 420 429, 2014. Kveton, B., Wen, Z., Ashkan, A., and Szepesv ari, C. Tight regret bounds for stochastic combinatorial semi-bandits. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, pp. 535 543, 2015. Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, pp. 661 670, 2010. Moradipari, A., Thrampoulidis, C., and Alizadeh, M. Stagewise conservative linear bandits. In Advances in Neural Information Processing Systems, volume 33, pp. 11191 11201, 2020. Qin, L., Chen, S., and Zhu, X. Contextual combinatorial bandit and its application on diversified online recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 461 469, 2014. Takemura, K. and Ito, S. An arm-wise randomization approach to combinatorial linear semi-bandits. In 2019 IEEE International Conference on Data Mining, pp. 1318 1323, 2019. Takemura, K., Ito, S., Hatano, D., Sumita, H., Fukunaga, T., Kakimura, N., and Kawarabayashi, K.-i. Near-optimal regret bounds for contextual combinatorial semi-bandits with linear payoff functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 9791 9798, 2021. Wang, S. and Chen, W. Thompson sampling for combinatorial semi-bandits. In International Conference on Machine Learning, pp. 5114 5122. PMLR, 2018. Wang, Y., Ouyang, H., Wang, C., Chen, J., Asamov, T., and Chang, Y. Efficient ordered combinatorial semi-bandits for whole-page recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, pp. 2746 2753, 2017. Contextual Conservative Interleaving Bandits Wu, Y., Shariff, R., Lattimore, T., and Szepesv ari, C. Conservative bandits. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 1254 1262. PMLR, 2016. Ziegler, C.-N., Mc Nee, S. M., Konstan, J. A., and Lausen, G. Improving recommendation lists through topic diversification. In Proceedings of the 14th international conference on World Wide Web, pp. 22 32, 2005. Contextual Conservative Interleaving Bandits A. Known Results Our proofs use the following known results: Theorem A.1 (Theorem 13.9 by Korte & Vygen (2018)). Let E be a finite set and B 2E. B is the set of bases of some matroid (E, F) if and only if the following holds: (B2) For any B1, B2 B and x B1 \ B2, there exists a y B2 \ B1 with (B1 \ {x}) {y} B. Theorem A.2 (Theorem 2 by Abbasi-Yadkori et al. (2011)). Let {Ft} t=0 be a filtration, {Xt} t=1 be an Rd-valued stochastic process such that Xt is Ft 1-measurable, {ηt} t=1 be a real-valued stochastic process such that ηt is Ft-measurable. Let V = λI be a positive definite matrix, Vt = V + P s [t] Xs X s , Yt = P s [t] θ Xs + ηs and ˆθt = V 1 t 1Yt. Assume for all t that ηt is conditionally R-sub-Gaussian for some R > 0 and θ 2 S. Then, for any δ > 0, with probability at least 1 δ, for any t 1, ˆθt θ Vt 1 R 2 log det(Vt 1)1/2 det(λI) 1/2 Furthermore, if Xt 2 L for all t 1, then with probability at least 1 δ, for all t 1, ˆθt θ Vt 1 R d log 1 + (t 1)L2/λ Lemma A.3 (Lemma 5 by Takemura et al. (2021)). Let {{xt(i)}i [k]}t [T ] be any sequence such that xt(i) Rd and xt(i) 2 L for all i [k] and t [T]. Let Vt = λI + P i [k] xs(i)xs(i) with λ > 0. Then, we have i [k] min 1 k , xt(i) 2 V 1 t 1 2d log(1 + L2k T/(dλ)). Lemma A.4 (Lemma 6 by Takemura et al. (2021)). Let {{xt(i)}i [k]}t [T ] be any sequence such that xt(i) Rd and xt(i) 2 L for all i [k] and t [T]. Let Vt = λI + P s [t] P i [k] xs(i)xs(i) with λ > 0. Then, we have X i [k] 1 xt(i) V 1 t 1 > 1/ k 2dk log(1 + L2k T/(dλ)). B. Discussions of Performance Constraints B.1. Comparison of Constraint (1) with Constraint (2) We discuss the fact that the two types of constraints are incomparable. More precisely, we show that each constraint does not include the other constraint. First, we show that constraint (1) does not include constraint (2). Lemma B.1. Assume that m 1. Then, there exists a set of rewards such that the rewards satisfy the constraint (1) while they do not satisfy the constraint (2) for any α [0, 1). Proof. Suppose that µt(i) = 0 for i It. Suppose also that µt(i) = 1 for some i Bt and µt(i) = 0 otherwise. Then, for any bijection ρt : It Bt, the constraint (1) is satisfied with m = 1. However, since P i It µt(i) = 0 and P i Bt µt(i) = 1, the constraint (2) is not satisfied for any α [0, 1). Next, we show the inverse direction. Lemma B.2. Assume that α (0, 1]. Then, there exists a set of rewards such that the rewards satisfy the constraint (2) while they do not satisfy the constraint (1) for any m < k. Proof. We fix α (0, 1] arbitrarily. Suppose that µt(i) = 1 α for i It and µt(b) = 1 for b Bt. These rewards satisfy the constraint (2) with α. Since any reward of arm b Bt is strictly larger than any reward of arm i It, we have P i It 1(µt(i) µt(ρt(i))) = 0 for any bijection ρt : It Bt. Contextual Conservative Interleaving Bandits B.2. Reduction from Constraint (2) to Constraint (1) We introduce a reduction from constraint (2) to constraint (1). Note that while Assumption 3.10 requires rℓ> 0 due to Assumption 3.9, most of existing studies of conservative bandit problems also require this condition (Wu et al., 2016; Kazerouni et al., 2017; Garcelon et al., 2020; Khezeli & Bitar, 2020; Moradipari et al., 2020). Lemma B.3. Under Assumption 3.10, satisfying constraint (1) implies satisfying constraint (2). Proof. Suppose that the sequence {It}t [T ] satisfies the constraints (1). Recall that ρt : It Bt is a bijection. Let I t = {i It | µt(i) µt(ρt(i))} Note that since {It}t [T ] satisfies the constraints (1), |I t| k m. Rewriting the performance constraints (2), we have i I t (µt(i) (1 α)µt(ρt(i))) X i It\I t ((1 α)µt(ρt(i)) µt(i)). Recall that rℓ= mint [T ] mini Bt µt(i). From the definition of I t and rℓ, and the fact that µt(i) [0, 1] for all i [N], we have X i I t (µt(i) (1 α)µt(ρt(i))) α X i I t µt(ρ(i)) α(k m)rℓ and i It\I t ((1 α)µt(ρt(i)) µt(i)) m. Thus, a sufficient condition of the constraints (2) is (k m)αrℓ m, which is Assumption 3.10. C. Properties of Matroid C.1. Strongly Base-Orderable Matroid First, we introduce the definition of strongly base-orderable matroid and exchangeable set. Definition C.1. A matroid is strongly base-orderable if for any two bases B1 and B2, there is a bijection f : B1 B2 with the property that (B1 \ X) f(X) is a base for any X B1. Then, we show an equivalence between exchangeable set and strongly base-orderable matroid. Lemma C.2. A set S is exchangeable if and only if the set S is the bases of a strongly base-orderable matroid. Proof. If the set S is the bases of a strongly base-orderable matroid, we have S is exchangeable by Definition C.1. Suppose that S is exchangeable. We fix B1, B2 S arbitrarily. Let ρ : B1 B2 be the bijection defined in (3). Since S is exchangeable, we have (B1 \ {i}) {ρ(i)} S for any i B1 \ B2. Thus, from Theorem A.1, there exists a matroid such that S is the set of the bases. By Definition 3.6 and Definition C.1, we have the desired result. C.2. Bijection between Two Bases Proof of Lemma 5.7. Let (E, F) be the matroid and k be the rank of the matroid. Let a 1, . . . , a k be a sequence such that {a i }i [k] = A and r(a i ) r(a j) for i j. We construct the bijection by induction. We fix ℓ [k] arbitrarily. Suppose that we have already defined ρ(a i ) for all i < ℓ. Let A ℓ= {a i }i=ℓ,...,k and Aℓ= A \ {ρ(a i )}i [ℓ 1]. If we have a ℓ Aℓ, we define ρ(a ℓ) = a ℓ. Thus, we consider the other case, i.e., a ℓ/ Aℓ. From the augmentation property of matroid, there exists a Aℓ\ A ℓ+1 such that A ℓ+1 {a} F. Since A argmax I A P i I r(i), A ℓcan be regarded as the output until the k ℓ+ 1-th step of the greedy algorithm. Therefore, we have r(a ℓ) r(a) and can define ρ(a ℓ) = a. Note that the construction in our proof of the following lemma is essentially the same to the proof of Lemma 1 of Kveton et al. (2014). Contextual Conservative Interleaving Bandits D. Missing Proofs in Section 5 D.1. High-Probability Event For δ (0, 1), we define βt(δ) = min R p 2 log (N(πkt)2/(3δ)), R p d log ((1 + L2kt/λ)/δ) + M for all t [T]. Proof of Lemma 5.3. We consider two cases: when log(N) < d and when log(N) d. First, we consider when log(N) < d. We fix t [T] and i [N] arbitrarily. From the definition of ˆθt, we have (ˆθt θ) xt(i) = i Is (θ xs(i) + ηs(i))xs(i) θ i Is xt(i) V 1 t 1xs(i)ηs(i) λxt(i) V 1 t 1θ. For the latter term, we have λxt(i) V 1 t 1θ λ θ V 1 t 1 xt(i) V 1 t 1 λM xt(i) V 1 t 1. We then bound the former term. Let α = R p 2 log(2/δ ) for δ > 0. Using (a variant of) Azuma s inequality, we obtain i Is xt(i) V 1 t 1xs(i)ηs(i) > α xt(i) V 1 t 1 α2 xt(i) 2 V 1 t 1 2R2 P i Is(xt(i) V 1 t 1xs(i))2 α2 xt(i) 2 V 1 t 1 2R2xt(i) V 1 t 1(P i Is xs(i)xs(i) )V 1 t 1xt(i) 2 exp( α2/(2R2)) δ . Hence, we obtain |(ˆθt θ) xt(i)| R p 2 log(N(πkt)2/(3δ)) + M λ xt(i) V 1 t 1 (9) with probability at least 1 6δ/(N(πkt)2). Taking union bound over the arms in all rounds, we have (9) with probability at least 1 δ. Next, we consider when log(N) d. By the Cauchy-Schwarz inequality, we have |µt(i) ˆθ t xt(i)| θ ˆθt Vt 1 xt(i) V 1 t 1 for all t [T] and i [N]. Then, using Theorem A.2, we have θ ˆθt Vt 1 R r d log 1+L2kt/λ t [T] with probability at least 1 δ, which completes the proof. D.2. Analysis for the Recommended Choices Proof of Lemma 5.2. We fix t [T] arbitrarily. Let It = It ˆIt and I t = {ρ t (i) | i It}. Let ([N], Ft) be the matroid whose bases are St. Let Et = It I t and kt = | It|. Then, we can construct a matroid (Et, F t) such that F t Ft and the size of bases of (Et, F t) is kt. Let S t be the bases of (Et, F t). Then, we have X i It (µt(ρ t (i)) µt(i)) max I S t i I µt(i) X i It µt(i). (10) Contextual Conservative Interleaving Bandits Recall that {r t(i)}i [N] is the estimation of the rewards by the algorithm A. Since ˆIt argmax I St P i I r t(i), we have It argmax I S t P i I r t(i). Thus, (10) is the regret of the CCS problem in which the feasible set is S t. Since the algorithm A uses only {rt(i)}i It as feedback in the proposed algorithm, we can bound (10) by the regret upper bound of the algorithm A. D.3. Property of the Bijection Satisfying (3) Lemma D.1. For any t [T] and I, I St such that I argmax J St P i J r(i) for some r : [N] R, we have r(i) r(ρ(i)) for any i I, where ρ : I I is the bijection satisfying (3). Proof of Lemma D.1. We fix t [T] and i It arbitrarily. Let ([N], F) be the matroid that corresponds to the constraint St. Let I(i) be the set of arms just before the greedy algorithm chooses i. By the definition of ρ, we have (I \{i}) {ρ(i)} St. From the property of matroid, we have I(i) {i} F and I(i) {ρ(i)} F, which finishes the proof. D.4. Properties of Confidence Intervals To prove Lemma 5.6, we need the following lemma: Lemma D.2. For any i ˆIt, if ct(i) < /2, we have µt(i) = µt(ˆρ t (i)). Proof. Recall that ct(i) = βt(δ) xt(i) V 1 t 1. From Lemma D.1, we have ˆrt(i) ˆrt(ˆρ t (i)) for all i ˆIt. Since (6) holds for all t [T] and i [N], we have µt(ˆρ t (i)) ˆrt(ˆρ t (i)) ˆrt(i) µt(i) + 2ct(i) for all i ˆIt. From the assumption of this lemma, we have µt(ˆρ t (i)) µt(i) 2ct(i) < . Thus, from the definition of , we obtain µt(i) = µt(ˆρ t (i)). We are ready to prove Lemma 5.6. Proof of Lemma 5.6. We fix i It arbitrarily. For all j ˆIt, using Lemma D.2, we have µt(j) = µt(ˆρ t (j)). Thus, we have µt(i) = µt( ρt(i)) or µt(i) µt( ρt(i)) . If µt(i) µt( ρt(i)) , we have rt( ρt(i)) rt(i) = ˇrt( ρt(i)) ˆrt(i) (µt( ρt(i)) 2ct( ρt(i))) (µt(i) + 2ct(i)) 2(ct(i) + ct( ρt(i))) which contradicts the property of ρt obtained by Lemma D.1. D.5. Number of Rounds in Which Conservative Choices Suffer the Regret Proof of Lemma 5.5. We fix ℓ [n] arbitrarily. To bound the number of rounds such that ct(jℓ t) /4, we consider a sufficient condition of rounds such that ct(jℓ t) < /4. Since βT (δ) βt(δ) for all t [T], a sufficient condition of ct(jℓ t) < /4 is that xt(jℓ t) V 1 t 1 < /(4βT (δ)) = γℓ/ Recall that Jℓ t is defined in line 16 of Algorithm 1. Let e Vt denote λI + P j Jℓ s xs(j)xs(j) . By the definition of Vt and e Vt, we have Vt e Vt for all t [T] {0}. Thus, we obtain xt(i) V 1 t 1 xt(i) e V 1 t 1 (12) Contextual Conservative Interleaving Bandits for all t [T] and i [N]. Combining (11) and (12), we obtain that a sufficient condition of rounds with ct(jℓ t) < /4 is that xt(jℓ t) e V 1 t 1 < γℓ/ ℓ. This implies that a necessary condition of rounds with ct(jℓ t) /4 is that xt(jℓ t) e V 1 t 1 eγℓ, where eγℓ= min(γℓ, 1)/ ℓ. Let T be the number of rounds such that ct(jℓ t) /4. Then, we obtain eγ2 ℓℓT eγ2 ℓ X 1 xt(j) e V 1 t 1 eγℓ X min eγ2 ℓ, xt(j) 2 e V 1 t 1 ℓ, xt(j) 2 e V 1 t 1 Applying Lemma A.3 to the last term, we finish the proof. D.6. Regret of an Arm in Conservative Choices Proof of Lemma 5.4. We fix i It \ ˆIt arbitrarily. Recall that ρt : It ˆIt be the bijection defined in (3). Let i = ρt(i). Since (6) holds, we have µt(i) ˆrt(i) 2ct(i). Moreover, since i Bt and i ˆIt \ Bt, we have ˆrt(i) ˇrt(i ) using Lemma D.1. Thus, we obtain µt(i) ˆrt(i) 2ct(i) ˇrt(i ) 2ct(i). (13) Recall that ˆρ t : ˆIt I t is the bijection defined in (3). From Lemma D.1, we have ˆrt(i ) ˆrt(ˆρ t (i )). Thus, we have µt(ˆρ t (i )) ˆrt(ˆρ t (i )) ˆrt(i ) = ˇrt(i ) + 2ct(i ), (14) where the first inequality is derived from (6). Recall that ρ t = ˆρ t ρt for i It \ ˆIt. Thus, we have ˆρ t (i ) = ρ t (i). Combining (13) and (14), we finish the proof. D.7. Regret Bound for (8) We can bound the regret as follows: j Jn t Rt(j) = X j Jn t Rt(j) Jn t = {j Jn t | xt(j) 2 e V 1 t 1,n 1/n} and Jn t = {j Jn t | xt(j) 2 e V 1 t 1,n > 1/n} for all t [T]. Recall that e Vt,n = λI + P j Jn t xs(j)xs(j) . Using Lemma 5.4 and the fact that Vt e Vt,n for all t [T], we have 2βt(δ) xt(j) e V 1 t 1,n and Rt(j)2/ 2 X xt(j) 2 e V 1 t 1,n. Thus, we finish the proof by following the same line of our proof of Theorem G.1. E. Lower Bounds of Regret E.1. Lower Bound when m = 0 Lemma E.1. If m = 0, for any algorithm, there is an instance of the CCIB problem such that the algorithm does not achieve sub-linear regret while satisfying the performance constraint (1). Contextual Conservative Interleaving Bandits Proof of Lemma E.1. We consider two instances of the CCIB problem and show that no algorithm can obtain sub-linear regret while satisfying the constraint for the two instances simultaneously. Let N = 2, d = 2, k = 1 and m = 0. Let xt(1) = (1, 0) and xt(2) = (0, 1) for all t [T], i.e., two-armed bandit problem with the stage-wise performance constraints. Suppose that rt(1) = 0.5 and Bt = {1} for all t [T]. We consider two types of rewards for this setting: One is rt(2) = 0 and the other is rt(2) = 1. In the former case, the baseline is optimal. Thus, to satisfy the constraints, an algorithm must choose the baseline (i.e., the first arm) in all rounds. In the latter case, however, such algorithm does not choose the second arm and then leads to a linear regret. E.2. Lower Bound when m > 0 First, we show a gap-independent bound. Theorem E.2. Consider stochastic conservative MAB problem with N arms and sub-Gaussian noises. Let RCMAB T denote the regret of T rounds. Let ibase be the baseline arm and it be the arm that are chosen. Suppose that there exist an algorithm and some α 0, 1 4e+2 i such that the algorithm satisfies E h P t [T ] 1 r(ibase) > r(it) i αT for any environment, where r(i) is the mean reward of arm i. Then, there is some environment such that E RCMAB T > q (N 1)T (8e+4)2α. Proof of Theorem E.2. Our proof largely follows the proof of Theorem 9 of Wu et al. (2016). We prove this theorem by contradiction. Thus, we assume that E[RCMAB T ] q (N 1)T (8e+4)2α for any environment. We define two types of environments, i.e. the reward distributions of arms and the baseline arm. One is defined as follows: µi = N( , 1) if i = 1 N(0, 1) otherwise , where we abuse that is defined later. The other type of environments is defined as N( , 1) if i = 1 N(2 , 1) if i = j N(0, 1) otherwise for j [N] \ {1}. Note that these environments have the same sub-optimality gap. Suppose that the baseline arm is the first arm for all environments. Let Pµ( ) and Eµ[ ] denote the probability and the expectation under reward distributions µ, respectively. Let Ti denote the number of times arm i was chosen in the problem. Let Ai = {Ti 2αT} and = q 4αT . First, we show Pµ(Ai) 1 2 for all i [N] \ {1}: Pµ(Ai) = 1 Pµ(Ti > 2αT) 1 Eµ[Ti] where the first inequality is derived from Markov s inequality, the second inequality follows from the assumption of the algorithm. Next, we show Pµ(i)(Ai) 1 4e for all i [N] \ {1}: Pµ(i)(Ai) = Pµ(i)(T Ti T 2αT) Eµ(i)[T Ti] Eµ(i)[RCMAB T ]/ T 2αT 1 (4e + 2) (8e + 4)α Contextual Conservative Interleaving Bandits where the first inequality is derived from Markov s inequality and the other inequalities holds due to the definition of the regret and the assumptions. Then, using Lemma 1 of Kaufmann et al. (2016), we have Eµ[Ti]KL(µi, µ(i) i ) d(Pµ(Ai), Pµ(i)(Ai)) 1 2 log 1 4(1/(4e)) where d(x, y) = x log x y + (1 x) log 1 x 1 y and KL(P, Q) is the KL-divergence of P and Q. By the definition of µ and µ(i), we have KL(µi, µ(i) i ) = 2 2. Therefore, we obtain Eµ[Ti] > 1 4 2 for all i [N] \ {1}. Using this fact, we have Eµ[RCMAB T ] = P i [N]\{1} Eµ[Ti] > N 1 4 = αT, which contradicts the assumption of the algorithm. Next, we show a gap-dependent bound. Proof of Theorem 6.1. We prove this theorem by the same line of our proof of Theorem E.2. Thus, we only discuss the difference from that proof. We assume E RCMAB T N 1 (32e+16)α for any environment. Let s be the largest number such that q N 1 4αks. Note that E RCMAB ks N 1 (32e+16)α . Let be Ai = {ksi 2αks}, where ksi is the number of times arm i was chosen until ks-th round. We consider the same environments as in Theorem E.2. Then, one can obtain Pµ(Ai) 1/2. Furthermore, since N 1 4αk(s+1), we have Pµ(i)(Ai) = Pµ(i)(ks ksi ks 2αks) Eµ(i) [ks ksi] Eµ(i) RCMAB ks / ks 2αks N 1 (32e + 8)α 2(ks 2αks) s + 1 (8e + 4)(s 2αs) 2s (8e + 4)(s 2αs) = 1 (4e + 2) (8e + 4)α where the first inequality is derived from Markov s inequality and the other inequalities holds due to the definition of the regret and the assumptions. Therefore, we obtain Eµ [ksi] > 1/(4 2). Finally, we obtain Eµ RCMAB ks > N 1 which contradicts the assumption of the algorithm. F. Experiments F.1. Environments In each setting, we set d = 20, T = 1000, k = 30, and n = 10. Moreover, any set of k arms can be chosen in each round, i.e., St = {I [N] | |I| = k} for all t [T]. Contextual Conservative Interleaving Bandits Table 2. Algorithms in Numerical Experiments Algorithm Parameters ε-greedy ε = 0.05 and λ = 1. C2UCB λ = 1 and δ = 0.05. Thompson sampling λ = 1 and δ = 0.05. i UCB α = n/k = 1/3. Lin-i UCB λ = 1, δ = 0.05, and α = n/k = 1/3. Proposed λ = 1, δ = 0.05 and n = 10. Algorithm 2 C2UCB Input: λ > 0 and {αt}t [T ] s.t. αt > 0 for all t [T]. 1: V0 λI and b0 0. 2: for t = 1, 2, . . . , T do 3: Observe {xt(i)}i [N] and St. 4: ˆθt V 1 t 1bt 1. 5: for i [N] do 6: ˆrt(i) ˆθ t xt(i) + αt q xt(i) V 1 t 1xt(i). 7: end for 8: Play a set of arms It argmax I St P i I ˆrt(i) and observe rewards {rt(i)}i It. 9: Vt Vt 1 + P i It xt(i)xt(i) and bt bt 1 + P i It rt(i)xt(i). 10: end for We use the Book-Crossing dataset (Ziegler et al., 2005), which consists of ratings expressed on a scale from 0 to 10 for books. We extracted users that rated at least 100 items and items that are rated by at least 3 users in the extracted users, which obtains 37, 157 ratings about 7, 068 items by 490 users. We consider the cold start problem as in Garcelon et al. (2020). An algorithm aims to learn the preference of a new user. We obtained the item vectors from the ratings by the weighted regularized matrix factorization (Hu et al., 2008) and used the item vectors as the feature vectors. The baseline is the recommendation by the matrix factorization. A user is randomly selected at the beginning of a simulation. The mean rewards are the selected user s ratings that are normalized in [0, 1]. The noises of the rewards follow N(0, 0.12). We ran 100 simulations. F.2. Algorithms We compare the proposed algorithm with the algorithms whose parameters are described in Table 2 and the baseline. The ε-greedy algorithm has two ways to estimate the rewards of given arms: One is to use random values, and the other is to use the ridge regression where λ is the parameter of the regularization. The algorithm chooses the former way with probability ε and the latter way otherwise. Then, it plays a feasible set of arms that maximizes the sum of the estimated rewards. We set α of the i UCB algorithms5 such that the i UCB and the proposed algorithms have the same condition for exploration, i.e., α = n/k. Note that α for the i UCB algorithm and α in (2) are different parameters. G. C2UCB Algorithm for Matroid Constraint We consider the CCS problem defined in Takemura et al. (2021). Note that this problem assumes that |µt(i)| B for some parameter B but does not assume that µt(i) [0, 1]. The CCS problem coincides with the CCIB problem without the performance constraints except the above assumption. We show that the C2UCB algorithm (Algorithm 2) is minimax optimal up to logarithmic factors for the CCS problem with any matroid constraint. We emphasize that our theorem is a generalization of Theorem 3 by Takemura et al. (2021) and that our proof is much simpler than that of the theorem by Takemura et al. (2021). While the CCS problem assumes that the 5Strictly speaking, we implemented the i UCB2 algorithm, which does not need to know the rewards of the baseline arms in advance. Contextual Conservative Interleaving Bandits number k of arms chosen in a round is fixed over the rounds, the C2UCB algorithm works and achieves the same regret bound if the number of arms chosen in a round is bounded by k from above. Let ρ t : It I t be the bijection defined by Lemma 5.7 with {ˆrt(i)}i [N]. Then, we define our sub-optimality gap as follows: t(i, j) = µt(j) µt(i) and = min t [T ] min i It: t(i,ρ t (i))>0 t(i, ρ t (i)). Using this sub-optimality gap, we have the following regret bound: Theorem G.1 (A generalization of Theorem 3 by Takemura et al. (2021)). Let κ = min(log(N), d), ν = log(k T) and ι = min(log(Nk T/δ), d log(k T/δ)). Assume that St is a set of bases of a matroid for all t [T]. Then, if αt = βt(δ) and λ = (R/M)2κ, the C2UCB algorithm has the following regret bound with probability 1 δ R(T) = O min R dk Tνι, R2dνι Proof. Let Jt = {i It | xt(i) V 1 t 1 > 1/ k} and J t = {ρ t (i) | i Jt}. Using these notations, the regret is rewritten as i I t \J t µt(i) X i It\Jt µt(i) i J t µt(i) X We bound these terms separately. First, we consider the former term. By the definition of the bijection ρ t , we have ˆrt(i) ˆrt(ρ t (i)) for all i It. Using Lemma 5.3, we have ˆrt(i) µt(ρ t (i)) and ˆrt(i) µt(i)+2ct(i) for all i It, where ct(i) = βt(δ) xt(i) V 1 t 1. Therefore, we obtain i I t \J t µt(i) X i It\Jt µt(i) i It\Jt 2ct(i) and i I t \J t µt(i) X i It\Jt µt(i) i It\Jt (µt(ρ t (i)) µt(i))2 i It\Jt 4ct(i)2/ Since βT (δ) βt(δ) for all t [T], using Lemma A.3, we have X i It\Jt ct(i) βT (δ) X i It\Jt xt(i) V 1 t 1 βT (δ) s i It\Jt xt(i) 2 V 1 t 1 dk Tι) and X i It\Jt ct(i)2/ = O(R2dι/ ). Next, we bound the latter term. From Lemma A.4, we have X t [T ] |Jt| = e O(dk). Since |µt(i)| B, we have i J t µt(i) X t [T ] |Jt| = O(Bdkν). Contextual Conservative Interleaving Bandits Algorithm 3 Thompson sampling Input: λ > 0 and {vt}t [T ] for all t [T]. 1: V0 λI and b0 0. 2: for t = 1, 2, . . . , T do 3: Observe {xt(i)}i [N] and St. 4: ˆθt V 1 t 1bt 1. 5: Sample θt from N ˆθt, v2 t V 1 t 1 6: for i [N] do 7: rt(i) θ t xt(i). 8: end for 9: Play a set of arms It argmax I St P i I rt(i) and observe rewards {rt(i)}i It. 10: Vt Vt 1 + P i It xt(i)xt(i) and bt bt 1 + P i It rt(i)xt(i). 11: end for H. Thompson Sampling for Matroid Constraint We show that the Thompson sampling Algorithm 3 for the CCS problem achieves a near-optimal regret bound. Note that, similar to the C2UCB algorithm, in the CCS problem where the number of arms chosen in t-th round depends on t, the Thompson sampling achieves the same regret bound if the number of arms chosen in a round is bounded by k from above. Theorem H.1. Let κ = min(log(N), d), ν = log(k T) and ι = min(log(Nk T 2/δ), d log(k T 2/δ)) log(d T/δ). Assume that St is a set of bases of a matroid for all t [T]. Then, if vt = βt(δ/(4T)) and λ = L/(d M), the Thompson sampling has the following regret bound with probability 1 δ R(T) = O (R dk Tνι + Bdkν . Proof. Let ρ t : It I t be the bijection defined by Lemma 5.7 with { rt(i)}i [N]. Let Jt = {i It | xt(i) V 1 t 1 > 1/ k} and J t = {ρ t (i) | i Jt}. Using these notations, the regret is rewritten as i I t \J t µt(i) X i It\Jt µt(i) i J t µt(i) X We bound these terms separately. First, we bound the latter term. By the same discussion in the proof of Theorem G.1, we have i J t µt(i) X t [T ] |Jt| = O(Bdkν). Next, we consider the former term. If µt(i) µt(ρ t (i)), the term µt(ρ t (i)) µt(i) does not contribute to the regret. Thus, we can assume i It \ Jt, µt(i) µt(ρ t (i)) (16) without loss of generality. We follow a similar line of proof for the Thompson sampling for the contextual linear bandit problem by Abeille & Lazaric (2017). As in Abeille & Lazaric (2017), we define the following events for all t [T]: ˆEt = { s t, ˆθs θ Vs 1 βs(δ )} and Et = { s t, θs ˆθs Vs 1 γs(δ )}, Contextual Conservative Interleaving Bandits where δ = δ/(4T) and γt(δ) = p 2d log(2d/δ)βt(δ). From a similar argument of Lemma 1 by Abeille & Lazaric (2017), we have P( ˆET ET ) 1 δ/2. In the remaining of this proof, we consider the case under the event ˆET ET . Then, we can decompose the right-hand side of (15) as follows: i I t \J t µt(i) X i It\Jt µt(i) i It\Jt θ xt(ρ t (i)) θ t xt(i) θ t xt(i) θ xt(i) We can bound (18) as θ t xt(i) θ xt(i) i It\Jt (γt(δ ) + βt(δ )) xt(i) V 1 t 1 = O (R κ + M To bound (17), we introduce convex functions Jt,i(u) = maxj Ct,i u xt(j) for some Ct,i [N] such that Jt,i( θt) = θ t xt(i) and Jt,i(θ) = θ xt(ρ t (i)). If we construct such convex functions, by the same line of the proof by Abeille & Lazaric (2017), we can complete the proof. In fact, we have i It\Jt θ xt(ρ t (i)) θ t xt(i) = O (R κ + M k T log(1/δ) with probability at least 1 δ/2. We now construct Jt,i(u). By the construction of ρ t , we have θ t xt(i) θ t xt(ρ t (i)). On the other hand, we have θ xt(i) θ xt(ρ t (i)) by (16). Therefore, if we define Ct,i = {i, ρ t (i)}, we have Jt,i( θt) = θ t xt(i) and Jt,i(θ) = θ xt(ρ t (i)).