# probably_anytimesafe_stochastic_combinatorial_semibandits__92a239e2.pdf

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Yunlong Hou 1 Vincent Y. F. Tan 1 2 3 Zixin Zhong 4

Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semibandits problem. In this problem, the agent is given the option to select a subset of size at most K from a set of L ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least 1 δ, over the entire horizon of time T, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm PASCOMBUCB that minimizes the regret over the horizon of time T. By developing accompanying information-theoretic lower bounds, we show that under both the problem-dependent and problem-independent paradigms, PASCOMBUCB is almost asymptotically optimal. Experiments are conducted to corroborate our theoretical findings. Our problem setup, the proposed PASCOMBUCB algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.

1Department of Mathematics, National University of Singapore, Singapore 2Department of Electrical and Computer Engineering, National University of Singapore, Singapore 3Institute of Operations Research and Analytics, National University of Singapore, Singapore 4Department of Computing Science, University of Alberta, Canada. Correspondence to: Zixin Zhong <zzhong10@ualberta.ca>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

1. Introduction

Audrey, a burgeoning social media influencer, makes profits by posting advertisements (ads) under her account. The advertiser pays her only if an ad is clicked. Having taken a class in online optimization, Audrey aims to leverage the theory of bandit algorithms to design an explorationexploitation strategy to ensure that the expected number of clicks of the ads she has posted is maximized. Since the platform is space-limited, Audrey can only post no more than K out of L available ads everyday. Some of these ads, however, include an innocuous-looking lottery or voucher that asks the viewer of the social media platform to provide personal information that may lead to fraud or information leakage. If a user clicks it and becomes a victim of fraud, this may damage Audrey s reputation. Audrey thus has to be circumspect in which and how many ads she posts.

On the one hand, Audrey wants to post as many ads that she believes to have high click-through rates as possible; the expected reward she obtains is then the sum of expected rewards of the individual ads. On the other hand, she should balance this with the total risk of the ads that are posted over a period of time; similarly, the risk of a set of ads posted is modeled as the sum of the risks of the individual ads. How should Audrey plan the posts of her ads over a period of time to learn their individual expected rewards and risks to ensure that her total expected reward is maximized and, at the same time, with high probability, the risk incurred at any point in time in her exploration-exploitation strategy is bounded by some fixed permissible threshold?

In addition to influencers like Audrey, online platforms that make profits by advertising such as You Tube and Tik Tok also encounter similar problems. We are therefore motivated to formulate the probably anytime-safe stochastic combinatorial semi-bandits (PASSCSB) problem which is a regret minimization problem with an anytime safety constraint. More precisely, we aim to design and analyze the performance of an algorithm that, with high probability, ensures that the risk (as measured by the variance) at any time step is below a given threshold and whose regret is minimized.

Literature review. There is a large body of works that take risk into account while conducting the exploration and/or exploitation of the unknown reward distributions in the stochastic multi-armed bandits (MABs) literature.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Under the risk-constrained pure exploration framework, Hou et al. (2023) and David et al. (2018) attempted to identify the optimal arm within those low-risk (based on their variances or α-quantiles) arms with probability at least 1 δ.

Under the risk-aware regret minimization setup, Sani et al. (2012), Vakili & Zhao (2016) and Zhu & Tan (2020) consider the mean-variance as the measure to be minimized over a fixed time horizon. Cassel et al. (2018) provided a general and systematic instruction to analyzing risk-aware MABs, i.e., the risk was incorporated in the Empirical Distribution Performance Measure and the U-UCB algorithm is adopted to perform proxy regret minimization . While these risk-aware algorithms reduce the overall risk during the exploration and exploitation process, the risk is not strictly enforced to be below a prescribed threshold; rather the risk measure is penalized within the objective function, similarly to a Lagrangian. Another setup similar to the riskaware setup is the constrained bandits regret minimization. Mahdavi et al. (2012) required that the number of times the constraint can only be violated is at most sublinear in the horizon T. Kagrecha et al. (2023) proposed a CVa R constraint and performed exploration on the feasible arm, followed by exploration among the feasible arm set. Unlike our formulation, these algorithm are permitted to sample risky arms during exploration.

A more stringent constraint can be found in the literature on conservative bandits (Wu et al., 2016), which requires the cumulative return at any time step to be above a constant fraction of the return resulting from repeatedly sampling the base arm. Kazerouni et al. (2017) extended this setup to conservative contextual linear bandits and this was further improved by Garcelon et al. (2020). A similar problem is bandits with knapsacks (Badanidiyuru et al., 2018), which imposes a budget on the cumulative consumed resources and the algorithm stops when the budget is depleted.

The most stringent constraint can be found in the safe bandits problem. Khezeli & Bitar (2020) and Moradipari et al. (2020) presented the SEGE, SCLUCB, and SCLTS algorithms to tackle this problem. This problem demands that the expected reward of the pulled arm at each time step be greater than a prescribed threshold with high probability, i.e., the stagewise safety constraint . The authors assumed the convexity (continuity) of the arm set and performed exploration around the explored safe arms, starting from a baseline safe arm. This continuity of the (super) arm set does not hold under the combinatorial semi-bandits setup. More comparisons are presented in App. E.

For the (unconstrained) combinatorial semi-bandits (CSB) setup, Chen et al. (2013) presented a UCB-type algorithm COMUCB1 to balance the trade-off between exploration and exploitation. Kveton et al. (2015) improved the analysis of COMUCB1 and achieved a tight upper bound (within a

specific set of instances). Kveton et al. (2014) introduced matroid structure to CSB and leveraged the matroid structure to design and analyze a greedy algorithm OMM. The risk-aware CSB problem is less studied by the community. Ayyagari & Dukkipati (2021) utilized CVa R as the riskaware measure within the CSB problem, where the risk constraint was not explicitly specified.

We observe that the existing literature mentioned above are not directly applicable to Audrey, while our setting (described formally below) dovetails neatly with her problem. Audrey can utilize our algorithm to sequentially and adaptively select different sets of ads everyday and almost always (i.e., with high probability) avoids sets of ads with unacceptably high risks. Beyond any specific applications, we believe that this problem is of fundamental theoretical importance in the broad context of regret minimization in combinatorial multi-armed bandits.

Main Contributions. Our first contribution lies at the formulation of a novel PASSCSB problem. In the PASSCSB problem, there are L items with different reward distributions. At each time step, a random reward is generated from each item s distribution. Based on the previous observations, the learning agent selects a solution at each time step. A solution consists of at most K items. The expected return (variance) of a solution is the summation of the reward (variance) of its constituents. Given T N, the agent aims to maximize the cumulative return over T time steps and ensure that with probability 1 δ the variance of all selected solutions are below a given threshold.

The key challenge of regret minimization under the PASSCSB lies in handling two distinct tasks we seek optimality in the mean and safeness in the variance of each chosen solution. Our second contribution is the design and analysis of the Probably Anytime-Safe Combinatorial UCB (or PASCOMBUCB) algorithm.

Thirdly, we also derive a problem-dependent upper bound on the regret of PASCOMBUCB, which involves a hardness parameter H( (Λ)). We see that H( (Λ)) characterizes the effectiveness of ascertaining the safety of potential solutions in the regret. To assess the optimality of PASCOMBUCB, we prove an accompanying problem-dependent lower bound on the regret of any variance-constrained consistent algorithm. The upper and lower problem-dependent bounds match in almost all the parameters (except in K). Additionally, we show that if δT decays exponentially fast in T, the problem-dependent regret cannot be logarithmic in T. We further present a problem-independent upper bound on the regret of PASCOMBUCB and a lower bound for any algorithm. Just as the problem-dependent bounds, these bounds also match in almost all the parameters.

Lastly, experiments are conducted to illustrate the empirical

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

performance and corroborate our theoretical findings.

In summary, this paper is the first to explore the regret minimization problem in the combinatorial bandits with an anytime constraint on the variance. When δ 1 and σ2 is large (so that the optimal safe solution is the one with the highest mean regardless of safety considerations), our problem reduces to the standard combinatorial semi-bandits (Kveton et al., 2015), and the regret incurred by the safety constraint vanishes, resulting in the same upper bound as the unconstrained case. Furthermore, the framework and analysis of PASCOMBUCB can be extended to other risk measures as long as there are appropriate concentration bounds, e.g., Bhat & Prashanth (2019) or Chang & Tan (2022) enables us to use CVa R or certain continuous functions as risk measures within the generic PASCOMBUCB framework.

2. Problem Setup

For m N, let [m] := {1, 2, . . . , m}. An instance of a variance-constrained stochastic combinatorial semi-bandit is a tuple Λ = (E, AK, ν, σ2). We describe the four elements of Λ in the following. Firstly, the finite set E = [L] is known as the ground set in which each i E is known as an item. Secondly, the family AK {S 2E : |S| K} is a collection of subsets of E with cardinality at most K. Each element S AK is known as a solution and AK satisfies the condition that all subsets of S AK remain solutions, i.e., AK is downward-closed. Thirdly, the vector of probability distributions ν = (ν1, ν2, . . . , νL) contains σ2sub-Gaussian distributions {νi}i E with means {µi}i E and variances {σ2 i }i E. The final element of an instance σ2 > 0 denotes the permissible upper bound on the variance. To avoid trivialities, we assume that σ2 > σ2 and K 2.

The return of item i E is the random variable Wi with distribution νi. The (stochastic) return of a solution S AK is P

i S Wi where Wi νi. The expected return and variance of S AK are

i S µi and σ2 S := X

respectively. We further assume that every instance Λ satisfies σ2 S = σ2 for all S AK and each distribution νi is supported in the interval [0, 1].

Define S := {S AK : σ2 S < σ2} to be the safe set which contains all the safe solutions. Let the complement of S be the unsafe set Sc. Denote the optimal safe solution as S := arg max{µS : S S} with return µ . For simplicity, we assume that S is unique. Denote the suboptimal set B := {S AK : µS < µ } and the risky set R := {S AK : µS µ , S = S }. For a solution S, let the mean gap S := µ µS and the variance gap v S := |σ2 S σ2|.

An instance Λ, time horizon T N and confidence param-

eter δ (0, 1) are specified. An agent, who knows E, AK and σ2 but not the vector of probability distributions ν, interacts adaptively with the instance over T time steps as follows. At time step t [T], the agent uses a stochastic function πt that selects a solution St AK based on the observation history Ht 1 := ((Ss, {Wi(s)}i Ss))s [t 1]. In other words, St = πt(Ht 1) is a stochastic function of the history Ht 1. The agent receives the random return P

i St Wi(t), where {W(s) = {Wi(s)}i E}s [T ] are i.i.d. according to ν across time. The weights of the selected items {Wi(t) : i St} are observed by the agent at each time t [T]. The collection of stochastic functions π = {πt}t [T ] is known as the agent s policy.

The goal of the agent is to minimize the expected cumulative regret (or simply regret) Reg(T) over the horizon T, subject to a certain risk constraint. More precisely, the regret suffered by a policy π employed by the agent is defined as

Regπ(T) := Eπ

i S Wi(t) X

The policy π should satisfy the condition that all the solutions chosen {Sπ t }t [T ] AK are safe with probability at least 1 δ, i.e.,

Pπ t [T], Sπ t S 1 δ. (1)

This is referred to as the probably anytime-safe constraint.

In the problem-dependent lower bounds, we will refer to a certain class of good policies that operate as the time horizon T and the probability of being safe in the sense of (1) tends to 1. This is formalized in the following.

Definition 2.1. Fix an instance ν and a vanishing sequence {δT } T =1 (0, 1). A policy π = {πt} t=1 is said to be a {δT } T =1-variance-constrained consistent algorithm if

Regπ(T) = o(T a) for all a > 0 and

Pπ t [T], Sπ t S 1 δT for all T N. We often omit the superscripts π in Regπ, Sπ t (or Aπ t and Aπ t,r in PASCOMBUCB) and the subscripts π in the probabilities and expectations if there is no risk of confusion.

3. Our Algorithm: PASCOMBUCB

Our algorithm Probably Anytime-Safe Combinatorial UCB (or PASCOMBUCB), presented in Algorithm 1, is designed to satisfy the probably anytime-safe constraint. In particular, we apply (and analyze) the GREEDY-SPLIT subroutine in Line 11; this subroutine has not been involved in an algorithm designed for standard combinatorial semi-bandits such as COMBUCB1 (Chen et al., 2013).

Statistics. Since each item i E is σ2-sub-Gaussian, any solution that contains at most q := σ2

σ2 items is safe with probability (w.p.) 1. We call such a solution absolutely

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Algorithm 1 PASCOMBUCB

1: Input: An instance Λ (with unknown ν), the horizon T and the confidence parameter δ (0, 1). 2: Set phase counter p = 1 and time step counter t = 1. 3: while i E such that Ti(p 1) < 2 do 4: Pull Ap =arg max S:|S| q |{i S : Ti(p 1)<2}|. 5: p p + 1, t t + 1. 6: end while 7: Update the sample mean, sample variance and confidence bounds according to (4). 8: Update the empirically safe set Sp and possibly safe set Sp according to (5) and (6) respectively. 9: while t < T do 10: Identify a solution Ap =arg max A Sp 1 U µ A(p 1). 11: Invoke GREEDY-SPLIT to split the solution Ap into np sub-solutions {Ap,1, . . . , Ap,np} Sp 1. 12: Set np min{np, T t}. 13: Choose solution {Ap,1, . . . , Ap,np}. 14: Update the statistics of all solutions based on (4). 15: Update the empirical sets based on (5) and (6). 16: Set t = t + np and p = p + 1, 17: end while

safe. Algorithm 1 (PASCOMBUCB) is conducted in phases, where each phase consists of multiple time steps and each item can be pulled at most once during each phase. Thus we adopt a different notation A to denote the solution in our algorithm. Define Ti(p) := Pp s=1 1{i Ap} as the number of times item i is pulled up to and including phase p. Denote the sample mean and sample variance of item i at phase p respectively as

ˆµi(p) := 1 Ti(p)

s=1 Wi(s) 1{i As}, and

ˆσ2 i (p) := 1 Ti(p)

s=1 (Wi(s) ˆµi(p))2 1{i As}.

The bound based on the Law of Iterated Logarithms (LIL) is used to construct the confidence radii. For a fixed ϵ (0, 1),

define lil(t, ρ) := (1 + ϵ) 1+ϵ

2t ln ln((1+ϵ)t)

ρ 1/2 and denote the confidence radius for the mean as

α(t) := lil(t, ωµ), (2)

where ωµ is a parameter to be chosen. The confidence radii for the variance are asymmetric about the empirical variance and are parameterized by ωv and ω v that may not necessarily be the same. They are defined as

βu(t) := 3 lil(t, ωv) and βl(t) := 3 lil(t, ω v). (3)

We denote the upper and lower confidence bounds (UCB

and LCB) for the mean of item i as

U µ i (p) := ˆµi(p) + α(Ti(p)) and

Lµ i (p) := ˆµi(p) α(Ti(p))

respectively. The UCB and LCB for the variance of item i are defined as

U v i (p) := min{ˆσ2 i (p) + βu(Ti(p)), σ2} and

Lv i (p) := max{ˆσ2 i (p) βl(Ti(p)), 0}

respectively. With the sample mean, sample variance, and confidence bounds for the items, we define the following statistics for all solution S AK:

i S ˆµi(p), ˆσ2 S(p) = X

i S ˆσ2 i (p),

U µ S (p) = X

i S U µ i (p), Lµ S(p) = X

i S Lµ i (p), (4)

U v S(p) = X

i S U v i (p), Lv S(p) = X

i s Lv i (p).

Denote the empirically safe set as

Sp := {S AK : U v S(p) < σ2} (5)

and the possibly safe set as

Sp := {S AK : Lv S(p) < σ2}. (6)

The solutions in St and St are called empirically safe and possibly safe solutions respectively.

Dynamics. In the initialization stage (lines 3 to 6), PASCOMBUCB greedily pulls the absolutely safe solutions. When each item has been pulled at least twice, this stage is terminated. After initialization, during phase p, PASCOMBUCB firstly identifies a solution Ap = arg max A Sp U µ A(p 1) via an optimization oracle (Line 10). It then calls a subroutine GREEDY-SPLIT to greedily partition the solution Ap into empirically safe sub-solutions (Line 11, see Figure 1 for illustration). Subsequently, these solutions are chosen and the stochastic rewards from the corresponding items are observed (Line 13). Lastly, the empirical estimates, the confidence bounds, and the empirical sets are updated (Lines 14 and 15).

Illustration. Figures 2 and 3 illustrate the regret accumulated during phase p and over the whole T horizon respectively. As shown in Figure 2, the regret accumulated during phase p can be decomposed into two parts

r=1 (µ µAp,r) = Ap + µ (np 1)

where Ap is the (phase-wise) instantaneous regret due to suboptimality and µ (np 1) is the instantaneous regret due to safeness-checking; the latter term results from

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Algorithm 2 GREEDY-SPLIT

1: Input: A solution Ap and the upper confidence bound on the variance U v(p 1) at phase p 1. 2: Set np = 1, s = 1 and Ap,1 = . 3: Index the items in Ap by i1, . . . , i|Ap|. 4: while s |Ap| do 5: if U v Ap,np(p 1) + U v is(p 1) σ2 then

6: Set Ap,np Ap,np {is}. 7: else 8: np np + 1 and Ap,np = {is}. 9: end if 10: s s + 1. 11: end while 12: return {Ap,1, . . . , Ap,np}.

Figure 1. A diagram of a split to a solution Ap containing 5 items.

the safeness constraint. At the beginning, since the upper confidence bounds of the variances of all solutions are large, each solution will be split into up to 2Q sub-solutions, where Q := K

q , and hence the regret due to safeness checking can be large. As the algorithm progresses, we obtain more observations of items and get more confident about their variances (U v i (p) decreases). Hence, during some later phase, it suffices to split some solutions into fewer sub-solutions and the regret due to safeness-checking reduces. Furthermore, when most items are sampled sufficiently many times, the unsafe solutions are excluded from the possibly safe set Sp, and the only contribution to the regret is via the suboptimality of the solution Ap.

Remark 3.1. The two parameters ωv and ω v determine the confidence radii of variances and do not necessarily have to be the same. The confidence parameter ω v is solely a parameter of PASCOMBUCB; its choice does not rely on the confidence parameter δ and only affects Lv S(p), the lower confidence bound of the variance, which determines when we ascertain a solution to be unsafe. The choice of ωv depends on δ and it influences U v S(p), the upper confidence bound of the variance, which guides PASCOMBUCB to split the solution to satisfy the probably anytime-safe constraint.

Instantaneous regret due to suboptimality

Instantaneous regret due to safeness-checking

Figure 2. Solution Ap is split into np = 3 sub-solutions, the instantaneous regret at phase p can be divided into the instantaneous regret due to suboptimality and the instantaneous regret due to safeness-checking.

Instantaneous Regret

Instantaneous regret due to suboptimality Instantaneous regret due to safeness-checking

Figure 3. An illustration of the instantaneous regret yielded by PASCOMBUCB. As the variances of the items are more determined, less regret due to safeness-checking is generated.

4. Problem-dependent Bounds

For simplicity, when a time horizon T and a confidence parameter δ = δT are given, we set the confidence parameters ωµ = ω v = 1 T 2 and ωv = δT

We introduce various suboptimality gaps that contribute to the regret due to the suboptimality.

for i E \ S , let the minimum safe-suboptimal gap be

i,S B,min := min S i,S S B S;

for i E, let the minimum unsafe-suboptimal gap be

i,Sc B,min := min S i, S Sc B S;

and let the tension parameter between the mean gap S and variance gap v S be

ci := max S i, S Sc B

S max{ S, v S/3}

We also define following safeness gaps that induce the conservative sampling strategy to guarantee the probably anytime-safe constraint. For i E, and

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

for the risky set R, define the minimum unsafeness gap v i,R := min S i,S R v S. for the safe and suboptimal set S B, let

Ψi,S B := max S i, S S B min ln T

2 S , 9 ln(T/δT )

which characterizes the order of the number of times that item i needs to be sampled in order to identify the suboptimality of all safe and suboptimal solutions A i while satisfying the safeness constraint. We further define a variant of Ψi,S B as

Ψ i,S B := max S i,S S B min ln T

2 S , 9 ln(1/δT )

which will be used to characterize the lower bound. for the unsafe and suboptimal set Sc B, let

Φi,Sc B := max S i, S Sc B min ln T

2 S , 9 ln T

which characterizes the hardness of identifying the unsafeness of suboptimality of all unsafe and suboptimal solutions that contain item i.

Define ξ(ω) := 2+ϵ

ϵ ω ln(1+ϵ) 1+ϵ, where ϵ (0, 1) is fixed.

4.1. Problem-dependent Upper Bound

Theorem 4.1 (Problem-dependent upper bound). Let Λ = (E, AK, ν, σ2) be an instance and let {δT } T =1 o(1) be a sequence that satisfies ln(1/δT ) = o(T b) for all b > 0 (i.e., {δT } is not exponentially decaying). Then, PASCOMBUCB is a {δT } T =1-variance-constrained consistent algorithm. More precisely, given a time budget T, the probably anytimesafe constraint is satisfied and the regret of PASCOMBUCB Reg(T) is upper bounded by

min {Tµ , Reg1(T) + Reg2(T)} + Reg3(T),

Reg1(T) = O X

K ln T i,S B,min + X

ci K ln T i,Sc B,min

Reg2(T) = 2µ H ( (Λ)) , Reg3(T) = 2µ (L + 1)

where (Λ) = { v S } { v i,R, Ψi,S B, Φi,Sc B}i E and H ( (Λ)) := H(1, Λ) is defined in (26) in App. B.4.

Remark 4.2. If the gaps in (Λ) are sufficiently small and δT = T λ for a fixed λ > 0,

H ( (Λ)) = O (λ + 1)K2 ln T

( v S )2 + K X

ln T ( v i,R)2

+ max S i, S S B min ln T

2 S , (λ + 1) ln T

+ Φi,Sc B .

See (26) for more details of this calculation.

The first term Tµ in the regret bound provides a na ıve upper bound for the expected regret conditional on the variance constraint holds. The order of the regret (o(T a) for all a > 0) implies the regret will be asymptotically bounded by the second term when the time budget T is sufficiently large. The second term is comprised of two parts the regret due to suboptimality Reg1(T) and the regret due to safeness-checking Reg2(T). The intuition for the regret due to suboptimality Reg1(T) is that

Each item in any safe and suboptimal solution will be sampled O( K ln T 2 i,S B,min ) times to ascertain the suboptimal-

ity of all safe and suboptimal solutions to which this item belongs to. Each item in an unsafe and suboptimal solution S will be sampled O K ln T max{ S, v S/3}2 times to ascertain either the suboptimality or the unsafeness of S. As this should be done for all the unsafe and suboptimal solutions, we need to take the maximum of the above time complexity. More precisely, when ci = 1, suboptimality identification of the unsafe and suboptimal solutions to which item i belongs dominates the regret; and when ci < 1, the ascertaining of the unsafeness dominates the regret.

The intuition for the regret due to safeness checking Reg2(T) is that H ( (Λ)) provides an upper bound for the number of time steps needed for guaranteeing the safeness of all solutions. PASCOMBUCB achieves this in a judicious manner since it does not check the safeness of all the solutions at the start, followed by exploration and exploitation of the possibly high-return safe solutions. Instead, it takes advantage of the fact that when a (safe or unsafe) suboptimal solution is ascertained to be suboptimal, its safeness can be disregarded, as reflected in the terms Ψi,S B and Ψi,Sc B. In addition, it will not sample an unsafe solution if it is identified as unsafe w.p. at least 1 2ξ(ω v). The last term Reg3(T) corresponds to the regret due to failure of the good event and at the initialization stage. A proof sketch is presented in Section 6.

4.2. Problem-dependent Lower Bound

Theorem 4.3 (Problem-dependent lower bound). Let {δT } T =1 o(1) be a sequence that satisfies ln(1/δT ) = o(T b) for all b > 0. There exists an instance Λ such that for any {δT }T N-variance-constrained consistent algorithm π, the regret is lower bounded by

ln T i,S B,min

K Ω K ln(1/δT )

Ψ i,S B + ln T ( v i,R)2 + Φi,Sc B .

The proof is presented at App. B.5. With Theorem 4.3, the

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

problem-dependent upper bound is tight for polynomially decaying {δT }T N.

Corollary 4.4 (Tightness of problem-dependent bounds). Let δT = T λ with a fixed λ > 0, the regret

ln T i,S B,min + µ

K2 H ( (Λ))

K ln T i,S B,min + µ H ( (Λ))

where H ( (Λ)) is defined in Remark 4.2. The upper bound above is achieved by PASCOMBUCB. Under different rates of decay of {δT }T N (see App. D for the cases where ln(1/δT ) = ω(ln T) and o(ln T)), the upper bound of the regret due to suboptimality Reg1(T) (the first term in the total regret) and the upper bound of the regret due to safeness-checking Reg2(T) (the latter term) match their corresponding lower bounds up to factors of K and K2 respectively; this gap is acceptable as K (e.g., number of ads displayed) is usually small relative to L (total number of ads). More discussions are postponed to App. E. One may naturally wonder whether we can tolerate a much more stringent probably anytime-safe constraint. The following theorem (with b = 1) indicates no algorithm is {δT }T Nvariance-constrained consistent if δT decays exponentially fast in T. Detailed proofs are postponed to App. C.

Theorem 4.5 (Impossibility result). Let {δT } T =1 o(1) be a sequence that satisfies that there exists b (0, 1] such that ln(1/δT ) = Ω(T b). For any instance Λ, the regret of any algorithm is lower bounded by Ω(T b).

5. Problem-independent Bounds

We can derive a problem-independent upper bound on the regret of PASCOMBUCB from the problem-dependent one in Theorem 4.1 with some delicate calculations (see App. B.5).

Theorem 5.1 (Problem-independent upper bound). Let {δT } T =1 o(1) be a sequence that satisfies ln(1/δT ) = o(T b) for all b > 0. If T > L, for any instance Λ with variance gaps lower bounded by v min S AK v S, the regret of PASCOMBUCB is upper bounded by

KLT ln T + LK2

Theorem 5.2 (Problem-independent lower bound). Let the minimum variance gap be v := min S AK v S. When K3 L2, we have

KLT + min n L ( v)2 ln 1

Remark 5.3. The assumption that the variance gaps of all solutions are lower bounded by v is needed to achieve a

non-vacuous problem-independent bound. Given any algorithm and time budget T, the variance gap of S can be arbitrarily small if v is not bounded away from zero, so the min in Theorem 5.2 will be dominated by the linear term T, and hence, no algorithm can attain sublinear regret.

Corollary 5.4 (Tightness of problem-independent bounds). Let K3 L2, and {δT } T =1 o(1) satisfies ln(1/δT ) = o(T b) for all b > 0. We have

KLT + L ( v)2 ln 1

KLT ln T + LK2

The upper bound is achieved by PASCOMBUCB. We observe that the gap between the upper and lower bounds is manifested on

ln T and K2. The presence of

ln T is not unexpected as it is also involved in the gap between the bounds on the regret for the (unconstrained) combinatorial bandits (Kveton et al., 2015). Besides, the term K2

is induced by the design of PASCOMBUCB. Additional discussions are provided in App. E.

6. Proof Sketch of the Problem-Dependent Upper Bound (Theorem 4.1)

Assume that PASCOMBUCB has processed T phases with T time steps, we have P[T T] = 1 since each phase is composed by multiple time steps. Denote the expected regret of PASCOMBUCB with p phases as E[R(p)]. The expected regret of PASCOMBUCB after T time steps is

E[R(T )] := E T X

r=1 (µ µAp,r) .

In the proof of Theorem 4.1, we first show a regret decomposition lemma (Lemma 6.1) that separates the total regret into the regret due to suboptimality E[R1(T )], the regret due to safeness-checking E[R2(T )] and the regret due to the failure of the good event and the initialization. Then we upper bound R1(T ) and R2(T ) separately. To elucidate the dependence of the regret on the confidence parameters ωµ, ωv and ω v, we retain these notations henceforth. Detailed proofs are presented in App. B.

For p [T], i E, define the good events that the sample mean and the sample variance are near their ground truths: Eµ i,Ti(p) := {ˆµi(p) α(Ti(p)) µi ˆµi(p) + α(Ti(p))} and Ev i,Ti(p)(ρ) := {ˆσ2 i (p) 3 lil(Ti(p), ρ) σ2 i ˆσ2 i (p)+ 3 lil(Ti(p), ρ)} and

Ei,Ti(p) := Eµ i,Ti(p) Ev i,Ti(p)(ωv) Ev i,Ti(p)(ω v)

p [T ] Ei,Ti(p 1).

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Instantaneous Regret

Instantaneous regret due to safeness-checking

Upper bound

Figure 4. We assume the algorithm will sample those solutions with large U v A(p), i.e., those phases in which more sub-solutions are sampled are moved forward (the dark red ones). Based on this, an upper bound can be derived (the thick black lines).

For r [Q 1], define Up(r) := {U v Ap(p 1) > r σ2}. When event Up(r) occurs at phase p, it indicates at least r + 1 sub-solutions are needed in order to sample the items in Ap and guarantee the safeness constraint.

Lemma 6.1. Assume that PASCOMBUCB has processed T phases with T time steps, the expected regret of PASCOMBUCB can be decomposed into three parts as follows

E[R(T )] E[R1(T )|E] + E[R2(T )|E] + R3(T)

where R1(T ) :=

p=1 1{Ap B} Ap

R2(T ) := µ T X

r=1 1{Up(r)}

R3(T) := 2µ L 1 + T ξ(ωµ) + 2ξ(ωv) + 2ξ(ω v

In Lemma 6.1, the first term R1(T ) is the (high-probability) regret due to suboptimality, in the sense that only the mean gaps of the suboptimal solutions contribute to R1(T). The second term R2(T ) is called the (high-probability) regret due to safeness-checking, since it depends on the variance gaps and goes to 0 if σ2 is sufficiently large. The last term R3(T) contains the regret from the initialization stage and the regret results from the failure of the good event E.

The regret due to suboptimality can be bounded in terms of the minimum safe/unsafe-suboptimal gaps as follows.

Lemma 6.2. Conditioned on event E, the regret due to suboptimality R1(T ) can be bounded by

K i,S B,min ln 1

ci K i,Sc B,min ln 1

The regret due to safeness-checking involves more critical parameters of the instance and we encode them in T r and H(r , Λ) for r [Q] (see Figure 5); these terms are defined formally in (25) and (26) respectively.

Instantaneous Regret

Upper bound

Figure 5. An illustration of the upper bound of R2(T ) for phase T Q 2 T < T Q 3. When r = 1, 2µ H(1, Λ) is the area below the thick line, i.e., the upper bound for R2(T ) for any T .

Lemma 6.3. On the event E, if T [T r , T r 1) then

R2(T ) 2µ [T (r 1) + H(r , Λ)] 2µ H(1, Λ)

To upper bound R2(T ), we assume the algorithm samples solutions with large U v A(p) in Sp, which will then be split into several sub-solutions (see Figure 4). Furthermore, for r = Q 1, Q 2, . . . , 1, we derive an upper bound for the number of phases in which event Up(r ) (Up(r + 1))c

occurs (at most 2r + 1 sub-solutions are being pulled in these phases). To be more specific (see Figure 5), for r = Q 1, we compute the maximum number of phases T Q 1 in which at most 2Q 1 sub-solutions are sampled. Then for r = Q 2, we compute the maximum number of phases T Q 2 T Q 1 in which at most 2Q 3 sub-solutions are sampled. We do this until the time budget runs out. As T increases, r decreases and H(r , Λ) increases. When r = 1, i.e. T T 1, H(1, Λ) is an upper bound for the total number of sub-solutions being pulled (up to a constant) for the safeness-checking or the price of satisfying the probably anytime-safe constraint. The upper bound for the regret due to safeness-checking is the instance-dependent constant 2µ H(1, Λ) when T T 1. More discussions are postponed to Step 3 in the proof in App. B.4.

7. Experiments In this section, we ran 3 sets of experiments to illustrate the empirical performance of PASCOMBUCB and to corroborate its theoretical guarantees. As COMBUCB1 (Kveton et al., 2015; Khezeli & Bitar, 2020; Amani et al., 2019) has tight regret guarantees, we adopt it as the benchmark in the unconstrained case. Codes are accessible at https://github.com/Y-Hou/PASSCSB.git.

Experimental Design: We design two instances where the rewards are Beta distributed with means and variances as in Table 1. There are L = 10 base arms and the admissible solution set AK contains all subsets of [L] with cardinality no greater than K = 3 (so |AK| = 175). Since the arm distributions are supported on [0, 1], the sub-Gaussian parameter σ2 = 0.25. The confidence parameter δ = 0.05.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

0 2 4 6 8 10

Cumulative regret

(a) Experiment 1

0 1 2 3 4 5

Cumulative reward

0.00% 0.00%

(b) Experiment 2

0 20 40 60 80 100 0

Additional regret

Additional Regret Linear fit

(c) Experiment 3

Figure 6. Results of the Experiments: (a) Experiment 1: Cumulative regret v.s. Time horizon; (b) Experiment 2: Cumulative reward v.s. Time horizon with the percentages of violations besides the data points; (c) Experiment 3: Additional regret v.s. 1/( v S)2.

Item index 1 2 3 4 5 to 10 Means 0.5 0.45 0.4 0.35 0.3 Variances (Set 1) 0.24 0.24 0.04 0.01 0.01 Variances (Set 2) 0.01 0.01 0.01 0.01 0.01

Table 1. Two sets of items with equal means for each item.

In Experiment 1, we quantify the additional regret due to the safeness checking and evaluate the performance of PASCOMBUCB with Set 1 under the unconstrained case. We run (1) PASCOMBUCB with σ = 0.6 which needs to check the safeness of the solutions; (2) PASCOMBUCB with σ = 0.751 which can be regarded as a variant of PASCOMBUCB without the safeness constraint, since our algorithm is aware of the safeness of all solutions; (3) COMBUCB1 which is a baseline algorithm. In Experiment 2, we illustrate the effectiveness of PASCOMBUCB in satisfying the safety constraint. Furthermore, we show that if an algorithm ignores the safety constraint, it will violate the safety constraint Ω(T) times if there exists a risky solution. We run PASCOMBUCB and COMBUCB1 with Set 1 under the constrained case where σ = 0.4, the optimal safe solution is {1, 3, 4}, and the optimal solution under the unconstrained case {1, 2, 3} is unsafe (risky). In Experiment 3, we empirically verify the dependence of the additional regret on the hardness parameter H( (Λ)) in (26) using Set 2. We fix the time horizon T = 2 106

and vary the threshold on the variance from 0.14 to 0.72 (i.e., σ2 = 0.14 1.2k for k = 0, 1, . . . , 9). As any solution that is comprised of 3 items has variance 0.03, we have v S = σ2 0.03. We compare the additional regret with respect to 1/( v S)2, which is proportional to H( (Λ)) under this setup according to (26).

Experimental Results: For Experiment 1, we present the results in Figure 6(a). We first observe when σ2 = 0.751, the regret incurred by PASCOMBUCB is similar to that by COMBUCB1 for all T considered, which suggests that PASCOMBUCB is comparable to COMBUCB1 under the

unconstrained case, and hence in the following experiments we refer the difference between the regret of PASCOMBUCB and the regret of COMBUCB1 as the additional regret . Secondly, when σ2 = 0.6, the regret of PASCOMBUCB increases rapidly at the beginning and plateaus when T > 4 105. This corroborates the design of PASCOMBUCB : (i) at the beginning, PASCOMBUCB pulls solutions conservatively to meet the anytime-safe constraint w.h.p.; (ii) after a number of time steps (T > 4 105), the safeness of the optimal (safe) solution can be ascertained, it then exploits the optimal solution aggressively and eventually matches the performance of COMBUCB1. For Experiment 2, we plot the percentage of times each algorithm violates the safeness constraint σ2 At < σ2 as well as the cumulative rewards in Figure 6(b). The reward of PASCOMBUCB increases slowly at the start and then more rapidly when T > 1.5 106, when the safeness of the optimal safe solution has been ascertained. However, while the reward of COMBUCB1 increases linearly (as it pulls the risky solution {1, 2, 3} Ω(T) times), it violates the safeness constraint σ2 St < σ2 at almost all times. This implies that the safety constraint is almost always violated by COMBUCB1 (Ω(T) times) whereas PASCOMBUCB can meet the probably anytime-safe requirement. For Experiment 3, the results are in Figure 6(c). As suggested by Theorem 4.1, the regret due to safeness checking is proportional to H( (Λ)). Figure 6(c) indicates that empirically, the additional regret scales linearly in 1/( v S)2, which corroborates our theoretical results.

Additional discussions on the tightness results, the problem formulation and comparisons with other literature, as well as future research directions, are presented in App. E.

Acknowledgements The authors are supported by Singapore Ministry of Education (MOE) grants (Grant Numbers: A-0009042-01-00, A-8000189-01-00, A-8000980-00-00, A-8000423-00-00) and funding from CIFAR through Amii and NSERC.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Amani, S., Alizadeh, M., and Thrampoulidis, C. Linear stochastic bandits under safety constraints. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, volume 32, pp. 9256 9266, 2019.

Ayyagari, S. and Dukkipati, A. Risk-aware algorithms for combinatorial semi-bandits, 2021.

Badanidiyuru, A., Kleinberg, R., and Slivkins, A. Bandits with knapsacks. Journal of the ACM, 65(3), 2018.

Bhat, S. P. and Prashanth, L. A. Concentration of risk measures: a wasserstein distance approach. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, volume 32, pp. 11762 11771. Curran Associates, Inc., 2019.

Cassel, A., Mannor, S., and Zeevi, A. A general approach to multi-armed bandits under risk criteria. In Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 1295 1306. PMLR, 2018.

Chang, J. Q. L. and Tan, V. Y. F. A unifying theory of Thompson sampling for continuous risk-averse bandits. In Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI), 2022.

Chen, W., Wang, Y., and Yuan, Y. Combinatorial multiarmed bandit: General framework and applications. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pp. 151 159. PMLR, 2013.

David, Y., Sz or enyi, B., Ghavamzadeh, M., Mannor, S., and Shimkin, N. PAC bandits with risk constraints. In ISAIM, 2018.

Garcelon, E., Ghavamzadeh, M., Lazaric, A., and Pirotta, M. Improved algorithms for conservative exploration in bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 3962 3969, 2020.

Hou, Y., Tan, V. Y. F., and Zhong, Z. Almost optimal variance-constrained best arm identification. IEEE Transactions on Information Theory, 2023.

Jamieson, K., Malloy, M., Nowak, R., and Bubeck, S. lil UCB: An optimal exploration algorithm for multiarmed bandits. In Proceedings of the 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pp. 423 439, Barcelona, Spain, 2014. PMLR.

Kagrecha, A., Nair, J., and Jagannathan, K. Constrained regret minimization for multi-criterion multi-armed bandits. Machine Learning, 2023.

Kaufmann, E., Capp e, O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17(1): 1 42, 2016.

Kazerouni, A., Ghavamzadeh, M., Abbasi-Yadkori, Y., and Van Roy, B. Conservative contextual linear bandits. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3913 3922. Curran Associates Inc., 2017.

Khezeli, K. and Bitar, E. Safe linear stochastic bandits. Proceedings of the AAAI Conference on Artificial Intelligence, 34(06):10202 10209, 2020.

Kveton, B., Wen, Z., Ashkan, A., Eydgahi, H., and Eriksson, B. Matroid bandits: Fast combinatorial optimization with learning. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, pp. 420 429, 2014.

Kveton, B., Wen, Z., Ashkan, A., and Szepesvari, C. Tight Regret Bounds for Stochastic Combinatorial Semi Bandits. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, volume 38, pp. 535 543. PMLR, 2015.

Mahdavi, M., Jin, R., and Yang, T. Trading regret for efficiency: Online convex optimization with long term constraints. Journal of Machine Learning Research, 13 (1):2503 2528, 2012.

Moradipari, A., Thrampoulidis, C., and Alizadeh, M. Stagewise conservative linear bandits. In Proceedings of the 34th International Conference on Neural Information Processing Systems, volume 33, pp. 11191 11201. Curran Associates, Inc., 2020.

Sani, A., Lazaric, A., and Munos, R. Risk-aversion in multiarmed bandits. In Proceedings of the 25th International Conference on Neural Information Processing Systems, pp. 3275 3283. Curran Associates Inc., 2012.

Vakili, S. and Zhao, Q. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093 1111, 2016.

Wu, Y., Shariff, R., Lattimore, T., and Szepesvari, C. Conservative bandits. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, pp. 1254 1262. PMLR, 2016.

Zhong, Z., Cheung, W. C., and Tan, V. Y. F. Thompson sampling algorithms for cascading bandits. Journal of Machine Learning Research, 22(218):1 66, 2021.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Zhu, Q. and Tan, V. Y. F. Thompson sampling algorithms for mean-variance bandits. In Proceedings of the 37th International Conference on Machine Learning, pp. 11599 11608. PMLR, 2020.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

The contents of the appendices are organized as follows:

In App. A, we list 3 useful lemmas concerning the LIL concentration bound. In App. B, we present detailed proofs of the upper bounds.

App. B.1: preliminary results for the proof of the upper bound; App. B.2: the proof of the decomposition lemma Lemma 6.1; App. B.3: the proof of Lemma 6.2 (the regret due to suboptimality); App. B.4: the proof of Lemma 6.3 (the regret due to safeness-checking); App. B.5: the proofs of Theorem 4.1 (problem-dependent upper bound) and Theorem 5.1 (problem-independent upper bound). In App. C, we present detailed proofs of the lower bounds.

App. C.1: preliminary results for the proof of the lower bound and the proof of the impossibility result Theorem 4.5. App. C.2: the proof of Theorem 4.3 (problem-dependent lower bound); App. C.3: the proof of Theorem 5.2 (problem-independent lower bound); In App. D, we present a corollary characterizing the tightness of the upper bound in Theorem 4.1. In App. E, we provide additional discussions on the tightness results, the problem formulation and comparisons with other literature, as well as future research directions.

A. Auxiliary results

Lemma A.1 (Lemma 3 in (Jamieson et al., 2014)). Let {Xi} i=1 be a sequence of i.i.d. centered sub-Gaussian random variables with scale parameter σ. Fix any ϵ (0, 1) and δ (0, ln(1 + ϵ)/e). Then one has

s=1 Xs (1+ ϵ)

2σ2 (1+ϵ) t ln ln ((1+ϵ)t)

where ξ(δ) := 2+ϵ

ϵ δ ln(1+ϵ) 1+ϵ.

Lemma A.2. For t 1, ϵ (0, 1), ω (0, 1] and u > 0, let γ := (1+ϵ)(1+ ϵ)2

2 , c := (u s)2

γ = 2(u s)2

(1+ϵ)(1+ ϵ)2 and

m := γ u2 s2 2 ln 1

ω + ln ln+ 1

s2 + ln 2γ(1+ϵ)

u2 . If t > m, it holds that

s > lil(t, ω) = (1 + ϵ)

2t ln ln ((1+ϵ)t)

Proof of Lemma A.2. Note that fact that

u s (1 + ϵ)

2t ln ln ((1+ϵ)t)

c = 2(u s)2

(1 + ϵ)(1 + ϵ)2 1

t ln ln ((1+ϵ)t)

According to the computations in Jamieson et al. (2014) equation (1), i.e.,

t ln ln((1 + ϵ)t)

c ln 2 ln((1 + ϵ)/(c ω))

for t 1, ϵ (0, 1), c > 0, ω (0, 1]. We take c = c, thus

c ln 2 ln((1 + ϵ)/(cω))

ω + ln ln γ(1 + ϵ)

u2 ω + ln 1

ω + ln γ(1 + ϵ)

u2 ω + ln ln+ 1 s2

ω + ln ln+ 1 s2 + ln 2γ(1 + ϵ)

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

where we adopt ln(x + y) x + ln ln+ y, x, y R+ in (a). Therefore, if t > m, we must have

u s > (1 + ϵ)

2t ln ln ((1+ϵ)t)

Lemma A.3. With the choice of the confidence radii in (2) and (3), for all i E, we have

P [ p N : |ˆµi(p) µi| α(Ti(p))] 1 2ξ(ωµ) (7)

P p N : |ˆσ2 i (p) σ2 i | βu(Ti(p)) 1 4ξ(ωv) (8)

P p N : |ˆσ2 i (p) σ2 i | βl(Ti(p)) 1 4ξ(ω v) (9)

Proof. Note the fact that any distribution supported on [0, 1] is 1/4-sub-Gaussian. By a direct application of Lemma A.1 to the sample mean ˆµi(p) and the sample second moment ˆ M2,i(p) := 1 Ti(p) Pp s=1 Wi(s)21{i As} of arm i [L], (7) can be derived and

P [ p N : |µi ˆµi(p)| lil(Ti(p), ω v)] 1 2ξ(ω v), and

P h p N : | ˆ M2,i(p) (µ2 i + σ2 i )| lil(Ti(p), ω v) i 1 2ξ(ω v)

Since the rewards are in [0, 1], |µ2 i ˆµ2 i (p)| = |µi + ˆµi(p)| |µi ˆµi(p)| 2 lil(Ti(p), ω v). Using this and the triangle inequality, we obtain for every p 1,

|ˆσ2 i (p) σ2 i | = |µ2 i ˆµ2 i (p)| + |(µ2 i + σ2 i ) ˆ M2,i(p)|

2 lil(Ti(p), ωv) + lil(Ti(p), ωv) = βu(Ti(p)).

Therefore, (8) is proved. (9) can be similarly obtained.

B. Proof of the Upper Bound

B.1. Proof scheme of the problem-dependent upper bound

In this subsection, we provide technical lemmas that can upper bound the components in R1(T ) and R2(T ).

Note that at phase p, the identified solution Ap belongs to one of the 4 disjoint sets: (1) Ap = S ; (2) S B; (3) R and (4) Sc B, i.e.

1 = 1 {Ap = S } + 1 {Ap S B} + 1 {Ap R} + 1 {Ap Sc B}

and 1 {Ap B} = 1 {Ap S B} + 1 {Ap Sc B}. Define two events (the F events) that connect the instance and the confidence radii

Fµ p := Ap 2 X

i Ap\S α(Ti(p 1))

Fp(x, ρ) := x 2 X

i Ap lil(Ti(p 1), ρ)

where x is a constant and ω is a confidence parameter. When Ap B, it indicates solution Ap has not been sampled sufficiently many times and its suboptimality has not been ascertained. When Ap Sc, it implies the unsafeness of Ap has not been recognized. We formalize this in the following lemma.

Lemma B.1. Conditional on the event E, given any p [T], we have

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

1 {Ap S B} 1 Fµ p ;

1 {Ap R} 1 Fp

v Ap 3 , ω v

If Ap Sc B,

1 {Ap Sc B} 1 Fµ p , Fp

v Ap 3 , ω v

Proof of Lemma B.1. By the design of PASCOMBUCB , Ap Sp 1.

(1) We firstly prove that S Sp 1, S S. On the event E, we have

Lv S(p 1) = X

i A max{ˆσ2 i (p 1) βl(Ti(p 1)), 0}

i A max{σ2 i , 0}

= σ2 S < σ2

Thus, S Sp 1, and in particular, S Sp 1.

(2) If Ap B, according to the sampling strategy in Line 10 of PASCOMBUCB and S Sp 1, we have U µ S (p 1) U µ Ap(p 1) which indicates P

i S \Ap U µ i (p 1) P

i Ap\S U µ i (p 1). Thus, X

i S \Ap µi X

i S \Ap U µ i (p 1)

i Ap\S U µ i (p 1)

i Ap\S µi + 2αi(Ti(p 1))

i Ap\S α(Ti(p 1)) (10)

(3) If Ap Sc, according to the sampling strategy, we have Ap Sp 1 which indicates Lv Ap(p 1) = P

i Ap Lv i (p 1) < σ2. Thus,

σ2 > ˆσ2 Ap(p 1) X

i Ap βl(Ti(p 1))

i Ap βl(Ti(p 1))

i Ap\S βl(Ti(p 1)) (11)

Note that if Ap S B, according to (10),

1 {Ap S B} 1 Fµ p .

If Ap R Sc, by (11)

1 {Ap R} 1 Fp

v Ap 3 , ω v

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

If Ap Sc B, by (11) and (10)

1 {Ap Sc B} 1 Fµ p , Fp

v Ap 3 , ω v

At phase p, we define two sequences of mutually-exclusive events {Gµ j,p}j N and {Gj,p(x, ω}j N (the G events) which can further bound the number of times Fµ p and Fv p (x, ω) occur respectively. These events are indexed by two strictly-decreasing sequences of constants:

a1 > a2 > . . . > ak > . . .

1 > b1 > b2 > . . . > bk > . . .

where limj aj = limj bj = 0. For simplicity, we set aj = 4 9j 2 , bj = 1 4j , j N and denote the constant C = P

j N aj bj = 259.2. For x R+ and ω (0, ln(1 + ϵ)/e), define

mj(x, ω) := aj γK2

ω + ln ln+ 1 x2 + D

and mj(x, ω) := otherwise, where (1) γ = (1+ϵ)(1+ ϵ)2

2 and ϵ is the constant in the confidence bounds (3), (2) ln ln+(x) = ln ln x if x e and it equals to 0 otherwise, (3) D = ln 324K2(1 + ϵ)2(1 + ϵ)2 . Denote

Gµ j,p := i Ap \ S : Ti(p 1) mj( Ap, ωµ) and

Gj,p(x, ω) := {i Ap : Ti(p 1) mj(x, ω)}

as the sets of items that were not chosen sufficiently often. For j N, the events at phase p are sequentially defined as

Gµ j,p := n at least bj K items in Ap \ S were chosen

at most mj( Ap, ωµ) times o \

k [j 1] Gµ k,p

= n Gµ j,p bj K o \

k [j 1] Gµ k,p

Gj,p(x, ω) := n at least bj K items in Ap were chosen

at most mj(x, ω) times o \

k [j 1] Gk,p(x, ω)

= n |Gj,p(x, ω)| bj K o \

k [j 1] Gk,p(x, ω)

Lemma B.2. With our choice of {aj}j N and {bj}j N,

if Fµ p occurs, Gµ j,p occurs for some j, i.e.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

if Fp(x, ω) occurs, Gj,p(x, ω) occurs for some j, i.e.

1 {Fp(x, ω)} 1

j N Gj,p(x, ω)

Proof of Lemma B.2. We prove (12) in the following and the other statement can be proved by the same procedures.

To ease the notations, we omit the parameters x, ω and p in Fp(x, ω), Gj,p(x, ω) and mj(x, ω), since they are fixed when given Fp(x, ω). The event Gj can be rewritten as

Gj = n |Gj| bj K o \

k [j 1] Gc k

= n |Gj| bj K o \

n |Gk| < bk K o

The statement is proved by contradiction. We assume that when F (Fp(x, ω)) occurs, none of event Gj occurs. Hence,

n |Gj| < bj K o [

n |Gk| bk K o

n |Gj| < bj K o

Let Gj := Ap \ Gj and define G0 = Ap. According to the definition of Gj, we have Gj Gj 1 and Gj 1 Gj, j N. Because limj mj = 0, there exists j0 such that Gj = Ap, j j0. Therefore, we can write Ap by the telescoping sum, i.e., Ap = j N Gj \ Gj 1 . Fp(x, ω) indicates

i Ap lil(Ti(p 1), ω) (13)

i Gj\ Gj 1 lil(Ti(p 1), ω)

Note that for i Gj \ Gj 1 = Gj 1 \ Gj, we have Ti(p 1) (mj, mj 1]. By Lemma A.2 with parameters

t = Ti(p 1), s = x, u = q

1 aj K2 and note a1 > aj, we have

1 aj K2 x > lil(Ti(p 1), ω).

Note our choice of aj and bj satisfy

bj 1 bj aj 1

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Thus, (13) can be further bounded by

i Gj\ Gj 1 lil(Ti(p 1), ω)

j N | Gj \ Gj 1|

(bj 1 bj)K aj K x

which constitutes a contradiction. Thus when F occurs, there must exists j N such that Gj occurs.

When none of the Gµ j,p (resp. Gj,p(x, ω)) occurs, Fµ p (resp. Fp(x, ω)) must not occur, which indicates all of the items in Ap have been sampled sufficiently many times such that the suboptimality (resp. unsafeness) of Ap is identified, thus Ap will not been sampled in future phases.

As there will be multiple F events happening, we provide the following useful lemma that merges all F events. Lemma B.3. Given two confidence parameters ω1 ω2 (0, ln(1 + ϵ)/e)

If Fµ p occurs, then Fp( Ap, ωµ) occurs.

If both events Fp(x, ω1) and Fp(y, ω2) occur, then event Fp

ln 1 ω1 ln 1 ω2 y}, ω1

Proof of Lemma B.3. If Fµ p occurs, we have

i Ap\S α(Ti(p 1)) 2 X

i Ap α(Ti(p 1)) = 2 X

i Ap lil(Ti(p 1), ωµ)

Thus, Fp( Ap, ωµ) occurs.

For the second statement, notice the fact that

lil(Ti(p 1), ω1) lil(Ti(p 1), ω2) = (1 + ϵ) 1+ϵ 2Ti(p 1) ln ln((1+ϵ)Ti(p 1))

(1 + ϵ) 1+ϵ

2t ln ln((1+ϵ)Ti(p 1))

v u u tln 1

ω1 + ln ln((1 + ϵ)Ti(p 1))

ω2 + ln ln((1 + ϵ)Ti(p 1))

v u u tln 1

where (a) utilizes the trick that a+c

b , a, b, c R+ and a b. When event Fp(y, ω2) occurs,

i Ap lil(Ti(p 1), ω2) 2 X

1 ω lil(Ti(p 1), ω1)

i Ap lil(Ti(p 1), ω1)

= Fp(ρy, ω1)

Thus if both events Fp(x, ω1) and Fp(y, ω2) occur, we must have that event Fp (max{x, ρy}, ω1)) =

ln 1 ω1 ln 1 ω2 y}, ω1) occurs.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

B.2. Upper bound decomposition

Lemma 6.1. Assume that PASCOMBUCB has processed T phases with T time steps, the expected regret of PASCOMBUCB can be decomposed into three parts as follows

E[R(T )] E[R1(T )|E] + E[R2(T )|E] + R3(T)

where R1(T ) :=

p=1 1{Ap B} Ap

R2(T ) := µ T X

r=1 1{Up(r)}

R3(T) := 2µ L 1 + T ξ(ωµ) + 2ξ(ωv) + 2ξ(ω v

Proof of Lemma 6.1. The expected regret can be decomposed as:

E [R(T )] = E

r=1 (µ µAp,r)

r=1 (µ µAp,r)1 {E}

r=1 (µ µAp,r)1 {Ec}

r=1 (µ µAp,r) E

r=1 (µ µAp,r) Ec

In the initialization stage, it will take at most 2L time steps, since each pulled solution Ap contains at least one item i with Ti(p) < 2. Thus the regret is at most 2L µ .

The expected regret when the good events fail can be upper bounded by

r=1 (µ µAp,r) Ec

L 2(ξ(ωµ) + 2ξ(ωv) + 2ξ(ω v))

2µ TL (ξ(ωµ) + 2ξ(ωv) + 2ξ(ω v))

where P[Ec] can be bounded using Lemma A.3.

Conditional on the good event E, the high-probability regret can be upper bounded by

r=1 (µ µAp,r) (14)

Ap + µ (np 1)

Ap + µ Q 1 X

2 1{U v Ap(p 1) > r σ2} 2 #

Ap + µ Q 1 X

r=1 2 1{U v Ap(p 1) > r σ2}

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

where (a) makes use of Lemma B.4 and the fact if U v Ap(p 1) ((m 1) σ2, m σ2], then m = PQ 1 r=0 1{U v Ap(p 1) >

r σ2} = PQ 1 r=0 1{Up(r)}. Note the fact that only the suboptimal solutions yields positive mean gap and that the negative mean gap of the risky solutions can be upper bounded by 0, thus the above equation can be further divided into two parts

Ap + µ Q 1 X

r=1 2 1{U v Ap(p 1) > r σ2}

p=1 1{Ap B} Ap +

p=1 µ Q 1 X

r=1 2 1{U v Ap(p 1) > r σ2}

=: R1(T ) + R2(T )

In conclusion, by summarizing the regret from the initialization stage, the regret due to failure of the good event and the high-probability regret, the expected regret can be bounded by

E[R(T )] E[R1(T )|E]P[E] + E[R2(T )|E]P[E] + 2µ TL (ξ(ωµ) + 2ξ(ωv) + 2ξ(ω v)) + 2µ L

E[R1(T )|E] + E[R2(T )|E] + 2µ TL (ξ(ωµ) + 2ξ(ωv) + 2ξ(ω v)) + 2µ L

= E[R1(T )|E] + E[R2(T )|E] + R3(T)

B.3. Regret due to suboptimality

Lemma 6.2. Conditioned on event E, the regret due to suboptimality R1(T ) can be bounded by

K i,S B,min ln 1

ci K i,Sc B,min ln 1

Proof of Lemma 6.2. Notice the fact that the suboptimal solution Ap can be safe, i.e. Ap S B, or unsafe, i.e., Ap Sc B, we upper bound the regret under these the two scenarios separately.

Case 1: Ap S B

By the definition of the event Gµ j,p, we have

|Gµ j,p| = i Ap \ S : Ti(p 1) mj( Ap, ωµ) bj K

which indicates

1 Ap S B, Gµ j,p 1 bj K

i Ap\S 1 Ap S B, Ti(p 1) mj( Ap, ωµ) .

Given an item i E \ S , assume it is included vi solutions in S B, we index them according to the decreasing order of their mean gaps, i.e. {Ai,k}k [vi] with Ai,1 . . . Ai,vi. Therefore,

p=1 1 {Ap S B} Ap 1 {E}

p=1 1 Ap S B, Fµ p Ap

j N 1 Ap S B, Gµ j,p Ap

i Ap\S 1 Ap S B, Ti(p 1) mj( Ap, ωµ) Ap

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

k [vi] 1 Ai,k = Ap, i Ap, Ti(p 1) mj( Ai,k, ωµ) Ai,k

bj K 1 Ai,k = Ap, i Ai,k, Ti(p 1) aj γK2

ωµ + ln ln+ 1 2 Ai,k + D

bj K 1 Ai,k = Ap, i Ai,k, Ti(p 1) aj γK2

ωµ + ln ln+ 1 2 Ai,vi + D

ωµ + ln ln+ 1 2 Ai,vi + D 1 Ai,vi +

k=2 Ai,k+1 1 2 Ai,k+1 1 2 Ai,k

ωµ + ln ln+ 1 2 Ai,vi + D

where (a) and b make use of Lemma B.1 and Lemma B.2, (c) is obtained by relaxing ln ln+ 1 Ai,k to ln ln+ 1 Ai,vi , (d) is obtained by solving the optimization problem.

Case 2: Ap Sc B

For the case where ωµ ω v, denote ω := r

ln 1 ω v ln 1 ωµ . Given an item i E, assume it is included vi solutions in

Sc B, we index them according to the decreasing order of their gaps A := max n ω A, v A 3 o , i.e. {Ai,k}k [vi] with

Ai,1 . . . Ai,vi. Denote ci := maxk [vi] Ai,k

2 , ki = arg maxk [vi] Ai,k and di = mink [vi] Ai,k, i.e., the minimum mean gap.

p=1 1 {Ap Sc B} Ap 1 {E}

p=1 1 Ap Sc B, Fµ p , Fp

v Ap 3 , ω v

p=1 1 Ap Sc B, Fp

max ω Ap, v Ap 3

j N 1 Ap Sc B, Gj,p

max ω µ, v Ap 3

i Ap 1 Ap Sc B, Ti(p 1) mj Ap, ω v Ap

k [vi] 1 Ai,k = Ap, i Ap, Ti(p 1) mj Ap, ω v Ai,k

bj K 1 Ai,k = Ap, i Ai,k, Ti(p 1) aj γK2

ω v + ln ln+ 1 2 Ai,k + D

Ai,k = Ap, i Ai,k, Ti(p 1) aj γK2

ω v + ln ln+ 1 2 Ai,vi + D

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

ω v + ln ln+ 1 2 Ai,vi + D

ω v + ln ln+ 1 2 Ai,vi + D

where (a) comes from Lemma B.1, (b) results from Lemma B.3, (c) is due to Lemma B.2, (d) is obtained by relaxing ln ln+ 1 2 Ai,k to ln ln+ 1 2 Ai,vi and (e) is achieved by solving the optimization problem in (15):

Ai,k = Ap, i Ai,k, Ti(p 1) aj γK2

ω v + ln ln+ 1 2 Ai,vi + D

ω v + ln ln+ 1 2 Ai,vi + D

ci di + Z Ai,ki

ω v + ln ln+ 1 2 Ai,vi + D

In conclusion, for i E \ S , denote

i,S B,min := min S i,S S B S

For i E, denote

i,Sc B,min := min S i,S Sc B S

i,Sc B := min S i,S Sc B max{ ω S, v S/3}

ci := max S i,S Sc B

S max{ ω S, v S/3}

The regret due to suboptimality can be upper bounded by

2CγK i,S B,min

ωµ + ln ln+ 1 2 i,S B,min + D

2ci CγK i,Sc B,min

ω v + ln ln+ 1 ( i,Sc B)2 + D

B.4. Regret due to safeness-checking

We firstly introduce two more technical lemmas in order to upper bound the regret. We characterize the number of sub-solutions np in Algorithm 2 by the following lemma.

Lemma B.4. At any phase p, we have P[np Q] = 1. Furthermore, if U v Ap(p 1) ((m 1) σ2, m σ2] for some m N, then m np 2m 1.

Proof of Lemma B.4. Recall that an absolutely safe solution S is safe w.p. 1 and S Sp 1, thus |Ap,r| q, r [np 1]. If np > Q, i.e. np Q + 1, then

r=1 |Ap,r| >

r=1 |Ap,r| Q q K

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

which constitutes a contradiction. Therefore, we have P[np Q] = 1.

If U v Ap(p 1) ((m 1) σ2, m σ2] for some m N+, we sequentially define {j1, j2, . . .} [|Ap|] as follows: denote j0 := 0 and let j1 be the integer such that

s=1 U v is(p 1) σ2 and

s=1 U v is(p 1) > σ2.

Then let j2 > j1 be the integer such that

s=j1+1 U v is(p 1) σ2 and

s=j1+1 U v is(p 1) > σ2.

The last integer jk = |Ap| satisfies

s=jk 1+1 U v is(p 1) σ2.

If k m + 1, we must have

U v At(p 1) =

s=jl 1+1 U v is(p 1)

s=jl 1+1 U v is(p 1) > m σ2

which contradicts with U v Ap(p 1) ((m 1) σ2, m σ2]. Hence, k m. Then we construct the sub-solutions by

Ap,l = {ijl 1+1, . . . , ijl 1}, l [k 1]

and Ap,k = {ijk 1+1, . . . , ijk}.

There are k 1 items {ijl : l [k 1]} left which will compose at most k 1 additional sub-solutions. In conclusion, we need at most 2m 1 sub-solutions, i.e., np 2m 1. Obviously, we need at least m sub-solutions since σ2 > σ2. So m np 2m 1.

Remark B.5. Indexing the items in Line 3 of GREEDY-SPLIT can be done arbitrarily, i.e., it does not require any specific order of the items. As such, GREEDY-SPLIT is an efficient greedy algorithm. We note that finding the optimal order that leads to the minimum number of sub-solutions np is a combinatorial problem which is generally hard to solve.

Due to the fact that the upper confidence of any solution S satisfies U v S(p) Q σ2, thus np can at most be 2Q 1.

Lemma 6.1 implies the key to upper bound the regret due to safeness-checking is to upper bound PQ 1 r=1 1{Up(r)} over the horizon T. From the definition of Up(r) := {U v Ap(p 1) > r σ2}, for r1, r2 [Q 1] with r1 > r2, event Up(r1) indicates

event Up(r2). Thus in order to upper bound PQ 1 r=1 1{Up(r)}, it suffices to upper bound 1{Up(r)} for r [Q 1]. To be more specific, given l [Q 1], in order to compute the maximum number of times PQ 1 r=1 1{Up(r)} l, we only need to compute the maximum number of times event Up(l) occurs.

In the following lemma, we show a necessary condition (in terms of event Fp(x, ω)) for event Up(r).

Lemma B.6. On the event E,

1 {Up(r)} 1

(r 1) σ2 + v Ap 3 , ωv

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

1 {Up(r)} 1

(r 1) σ2 v Ap 3 , ωv

Proof of Lemma B.6. The proof is straightforward

U v Ap(p 1) > r σ2

(a) ˆσ2 Ap(p 1) + X

i Ap βu(Ti(p 1)) > r σ2

(b) σ2 Ap + 2 X

i Ap βu(Ti(p 1)) > r σ2

i Ap 3 lil(Ti(p 1), ωv) > (r 1) σ2 + ( σ2 σ2 Ap)

where (a) is due to the definition of the confidence bounds for a solution (4), and (b) utilizes the event T

i Ap Ei,Ti(p 1).

For Ap S, the above event is equivalent to Fp

(r 1) σ2+ v Ap 3 , ωv

; and for Ap Sc, it is equivalent to

(r 1) σ2 v Ap 3 , ωv

The above lemma and Lemma B.1 upper bound the components in R2(T ) by the F events.

We are now ready to bound the regret due to safeness checking.

Lemma 6.3. On the event E, if T [T r , T r 1) then

R2(T ) 2µ [T (r 1) + H(r , Λ)] 2µ H(1, Λ)

Proof of Lemma 6.3. From Lemma 6.1, the high-probability regret due to safeness-checking is

R2(T ) = µ T X

r=1 1 {Up(r)}

p=1 1 {Up(r)}

In the following, given r [Q 1], we are going to upper bound PT

p=1 1{Up(r)} conditional on event E. When U v Ap(p 1) > r σ2 holds, there are at most 2r + 1 solutions being chosen at phase p, i.e. np 2r + 1, according to Lemma B.4. Therefore, we are also deriving an upper bound for number of phases in which there are at most 2r + 1 sub-solutions being sampled.

The proof scheme is planned as follows: in Step 1, we decompose the event Up(r) into 4 events according to where Ap lies, i.e. (1) Ap = S ; (2) S B; (3) R and (4) Sc B. We will upper bound the regret under each of these cases.

In Step 2, we apply Lemma B.1 to upper bound the number of times a solution A can be selected via the events Fµ p and Fp(x, ω). Because there will be multiple Fµ p and Fp(x, ω) events in the indicator function, Lemma B.3 will be adopted to merge them into one event. After that, Lemma B.2 is utilized to bridge the number of times a solution is identified to the number of times of an item is sampled. At the end of this step, we conclude the number of times 1{Up(r)} occurs under the four cases.

In Step 3, we upper bound PQ 1 r=1 PT

p=1 1 {Up(r)} based on the results from Step 2.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Step 1: We decompose PT

p=1 1{Up(r)} conditional on event E into four parts:

p=1 1{Up(r)}

p=1 1 n U v Ap(p 1) > r σ2o

1 {Ap = S } + 1 {Ap S B} + 1 {Ap R} + 1 {Ap Sc B} 1 n U v Ap(p 1) > r σ2o

p=1 1 {Ap = S } 1 n U v Ap(p 1) > r σ2o +

p=1 1 {Ap S B} 1 n U v Ap(p 1) > r σ2o

p=1 1 {Ap R} 1 n U v Ap(p 1) > r σ2o +

p=1 1 {Ap Sc B} 1 n U v Ap(p 1) > r σ2o

Step 2: For each of the scenarios, we firstly upper bound the regret by the F events and they can be further bounded in terms of G events .

Case 1: Ap = S

p=1 1 {Ap = S } 1 n U v Ap(p 1) > r σ2o

p=1 1 {Ap = S } 1

(r 1) σ2 + v Ap 3 , ωv

j N 1 {Ap = S } 1

(r 1) σ2 + v Ap 3 , ωv

j N 1 {Ap = S } 1 bj K

(r 1) σ2 + v Ap 3 , ωv

1 bj K 1 Ti(p 1) mj

(r 1) σ2 + v S 3 , ωv

(r 1) σ2 + v S 3 , ωv

bj K 9 ((r 1) σ2 + v S )2

ωv + ln ln+ 1 ((r 1) σ2 + v S )2 + D

i S C 9 γK ((r 1) σ2 + v S )2

ωv + ln ln+ 1 v S 2 + D

where (a) utilizes Lemma B.6, (b) makes use of Lemma B.2 and (c) follows the definition of Gj,p. For simplicity, we denote

g S (r, v S ) = C 9 γK ((r 1) σ2 + v S )2

ωv + ln ln+ 1 v S 2 + D .

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

p=1 1 {Ap = S } 1 n U v Ap(p 1) > r σ2o X

i S g S (r, v S )

Case 2: Ap S B

Under this case, there will be a comparison between ωµ and ωv thus we denote ω =

ln 1 ωµ ln 1 ωv . For i E, denote

i,S B := min S i,S S B max S

ln(1/ωµ), v S 3

which is achieved by solution Si,S B, and assume i,S B

ln(1/ωv), (ri+1) σ2

ln(1/ωv)] for some ri N.

Scenario 1: ωµ ωv

We firstly deal with the case where ωµ ωv, i.e., ω 1. We have

p=1 1 {Ap S B} 1 n U v Ap(p 1) > r σ2o

p=1 1 {Ap S B} 1

(r 1) σ2 + v Ap 3 , ωv

p=1 1 {Ap S B} 1

Fp Ap, ωµ , Fp

(r 1) σ2 + v Ap 3 , ωv

p=1 1 {Ap S B} 1

Ap, ω (r 1) σ2 + v Ap 3

j N 1 {Ap S B} 1

Ap, ω (r 1) σ2 + v Ap 3

j N 1 {Ap S B} 1 bj K

Ap, ω (r 1) σ2 + v Ap 3

p=1 1 {Ap S B} 1

i Ap, Ti(p 1) mj

Ap, ω (r 1) σ2 + v Ap 3

where (a) utilizes Lemma B.6, (b) and (c) make use of Lemma B.3, (d) is due to Lemma B.2 and (e) follows the definition of Gj,p.

(1) if r ri + 2, for any S S B that contains item i,

max S, ω (r 1) σ2 + v S 3

3 ω (ri + 1) σ2

p=1 1 {Ap S B} 1

i Ap, Ti(p 1) mj

Ap, ω (r 1) σ2 + v Ap 3

p=1 1 {Ap S B} 1 i Ap, Ti(p 1) mj

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

bj K 9 ((r 1) σ2)2

ω2 ln ln+ 9 ( ω(r 1) σ2)2 + 1

C 9 γK ((r 1) σ2)2

ω2 ln ln+ 1 ln(1/ωµ) 2 i,S B + 1

2 + 1 ln(1/ωµ) ln ln+ 1 ln(1/ωµ) 2 i,S B + 1 ln(1/ωµ)D

(2) if r ri + 1, for any S S B that contains item i

max S, ω (r 1) σ2 + v S 3

max S, ω v S 3

p=1 1 {Ap S B} 1

i Ap, Ti(p 1) mj

Ap, ω (r 1) σ2 + v Ap 3

p=1 1 {Ap S B} 1

i Ap, Ti(p 1) mj

ωµ i,S B, ωµ

ωµ i,S B, ωµ

ln 1 ωµ i,S B)2

ωµ + ln ln+ 1

ln 1 ωµ i,S B)2 + D

C γK 2 i,S B

2 + 1 ln(1/ωµ) ln ln+ 1 ln(1/ωµ) 2 i,S B + 1 ln(1/ωµ)D

where (a) is achieved by sampling Ai,S B. Therefore, if we denote g S B,1(r, i,S B) :=

2 + 1 ln(1/ωµ) ln ln+ 1 ln(1/ωµ) 2 i,S B + 1 ln(1/ωµ)D

ln(1/ωv) i,S B

CγK 2 i,S B

2 + 1 ln(1/ωµ) ln ln+ 1 ln(1/ωµ) 2 i,S B + 1 ln(1/ωµ)D

ln(1/ωv) i,S B

then (16) can be upper bounded by

i E g S B,1(r, i,S B)

Scenario 2: ωµ ωv

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

For the case where ωµ ωv, i.e., ω 1. We have

p=1 1 {Ap S B} 1 n U v Ap(p 1) > r σ2o

p=1 1 {Ap S B} 1

(r 1) σ2 + v Ap 3 , ωv

p=1 1 {Ap S B} 1

Fp Ap, ωµ , Fp

(r 1) σ2 + v Ap 3 , ωv

p=1 1 {Ap S B} 1

ω , (r 1) σ2 + v Ap 3

j N 1 {Ap S B} 1

ω , (r 1) σ2 + v Ap 3

j N 1 {Ap S B} 1 bj K

ω , (r 1) σ2 + v Ap 3

p=1 1 {Ap S B} 1

i Ap, Ti(p 1) mj

ω , (r 1) σ2 + v Ap 3

(1) if r ri + 2, for any S S B that contains item i,

ω , (r 1) σ2 + v S 3

3 (ri + 1) σ2

p=1 1 {Ap S B} 1

i Ap, Ti(p 1) mj

ω , (r 1) σ2 + v Ap 3

p=1 1 {Ap S B} 1 i Ap, Ti(p 1) mj

bj K 9 ((r 1) σ2)2

ωv + ln ln+ 9 ((r 1) σ2)2 + D

C 9 γK ((r 1) σ2)2

ωv + ln ln+ 1 ln(1/ωv) 2 i,S B + D

2 + 1 ln(1/ωv) ln ln+ 1 ln(1/ωv) 2 i,S B + 1 ln(1/ωv)D

(2) if r ri + 1, for any S S B that contains item i,

ω , (r 1) σ2 + v S 3

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

p=1 1 {Ap S B} 1

i Ap, Ti(p 1) mj

ω , (r 1) σ2 + v Ap 3

p=1 1 {Ap S B} 1 i Ap, Ti(p 1) mj

ωv i,S B, ωv

ωv i,S B, ωv

ωv + ln ln+ 1

ωv i,S B)2 + D

C γK 2 i,S B

2 + 1 ln(1/ωv) ln ln+ 1 ln(1/ωv) 2 i,S B + 1 ln(1/ωv)D

Therefore, if we denote g S B,2(r, i,S B) :=

2 + 1 ln(1/ωv) ln ln+ 1 ln(1/ωv) 2 i,S B + 1 ln(1/ωv)D

ln(1/ωv) i,S B

CγK 2 i,S B

2 + 1 ln(1/ωv) ln ln+ 1 ln(1/ωv) 2 i,S B + 1 ln(1/ωv)D

ln(1/ωv) i,S B

then (17) can be upper bounded by X

i E g S B,2(r, i,S B).

In conclusion, for i E, we denote i,S B := min S i,S S B max S

ln(1/ωµ), v A 3

, ωµv := max{ωµ, ωv} and

g S B(r, i,S B) :=

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv) 2 i,S B + D

ln(1/ωv) i,S B

CγK 2 i,S B

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv) 2 i,S B + D

ln(1/ωv) i,S B

p=1 1 {Ap S B} 1 n U v Ap(p 1) > r σ2o X

i E g S B(r, i,S B)

Case 3: Ap R

Under this case, there will be a comparison between ωv and ω v thus we denote ω =

ln 1 ω v ln 1 ωv and ωsum := q

For i E, denote v i,R := min S i,S R v S and assume v i,R (ri ω ω+1 σ2, (ri + 1) ω ω+1 σ2].

Scenario 1: ωv ω v For the case ωv ω v, we have ω 1.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

p=1 1 {Ap R} 1 n U v Ap(p 1) > r σ2o

p=1 1 {Ap R} 1

v Ap 3 , ω v

(r 1) σ2 v Ap 3 , ωv

p=1 1 {Ap R} 1

( v Ap 3 , ω (r 1) σ2 v Ap 3

j N 1 {Ap R} 1

( v Ap 3 , ω (r 1) σ2 v Ap 3

j N 1 {Ap R} 1 bj K

( v Ap 3 , ω (r 1) σ2 v Ap 3

p=1 1 {Ap R} 1

i Ap, Ti(p 1) mj

( v Ap 3 , ω (r 1) σ2 v Ap 3

where (a) utilizes Lemma B.6, (b) makes use of Lemma B.3, (c) is due to Lemma B.2 and (d) follows the definition of Gj,p.

(1) if r ri + 2, for any S R that contains item i,

max v S 3 , ω (r 1) σ2 v S 3

max v S 3 , ω 1 + ω (r 1) σ2

ω 1 + ω (r 1) σ2

where the first inequality uses the fact that for x, y R+ and z [0, 1], , max{x, y} max{x, xz + y(1 z)}. We take

x = v Ap 3 , y = ω (r 1) σ2 v Ap 3 and z = ω 1+ ω. Thus,

p=1 1 {Ap R} 1

i Ap, Ti(p 1) mj

( v Ap 3 , ω (r 1) σ2 v Ap 3

p=1 1 {Ap R} 1 i Ap, Ti(p 1) mj

ω 1 + ω (r 1) σ2

ω 1 + ω (r 1) σ2

( ω 1+ ω (r 1) σ2

ω v + ln ln+ 1

( ω 1+ ω (r 1) σ2

( ω 1+ ω (r 1) σ2

ω v + ln ln+ 1

( ω 1+ ω (r 1) σ2

2 + 1 ln(1/ω v)

ln(1/ω v)( v i,R 3

ln(1/ω v))2 + D

(2) if r ri + 1, for any S R that contains item i,

max v S 3 , ω (r 1) σ2 v S 3

max v S 3 , ω 1 + ω (r 1) σ2

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

p=1 1 {Ap R} 1

i Ap, Ti(p 1) mj

( v Ap 3 , ω (r 1) σ2 v Ap 3

p=1 1 {Ap R} 1 i Ap, Ti(p 1) mj

v i,R 3 , ω v

v i,R 3 , ω v

( v i,R 3 )2

ω v + ln ln+ 1

( v i,R 3 )2 + D

( v i,R 3 )2

ω v + ln ln+ 1

( v i,R 3 )2 + D

ln(1/ω v))2

2 + 1 ln(1/ω v)

ln(1/ω v)( v i,R 3

ln(1/ω v))2 + D

Therefore, if we denote

g R,1(r, v i,R) :=

2 + 1 ln(1/ω v)

ln(1/ω v)( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

ln(1/ω v))2

2 + 1 ln(1/ω v)

ln(1/ω v)( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

then (18) can be upper bounded by X

i E g R,1(r, v i,R).

Scenario 2: ωv ω v For the case ωv ω v, we have ω 1.

p=1 1 {Ap R} 1 n U v Ap(p 1) > r σ2o

p=1 1 {Ap R} 1

v Ap 3 , ω v

(r 1) σ2 v Ap 3 , ωv

p=1 1 {Ap R} 1

v Ap 3 , (r 1) σ2 v Ap 3

j N 1 {Ap R} 1

v Ap 3 , (r 1) σ2 v Ap 3

j N 1 {Ap R} 1 bj K

v Ap 3 , (r 1) σ2 v Ap 3

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

p=1 1 {Ap R} 1

i Ap, Ti(p 1) mj

v Ap 3 , (r 1) σ2 v Ap 3

where (a) utilizes Lemma B.6, (b) makes use of Lemma B.3, (c) is due to Lemma B.2 and (d) follows the definition of Gj,p.

(1) if r ri + 2, for any S R that contains item i,

ω v S 3 , (r 1) σ2 v S 3

ω v S 3 , ω 1 + ω (r 1) σ2

1 1 + ω (r 1) σ2

where the first inequality uses the fact that for x, y R+ and z [0, 1], , max{x, y} max{x, xz + y(1 z)}. We take

v Ap 3 , y = (r 1) σ2 v Ap 3 and z = ω 1+ ω. Thus,

p=1 1 {Ap R} 1

i Ap, Ti(p 1) mj

v Ap 3 , (r 1) σ2 v Ap 3

p=1 1 {Ap R} 1 i Ap, Ti(p 1) mj

1 1 + ω (r 1) σ2

1 1 + ω (r 1) σ2

( 1 1+ ω (r 1) σ2

ωv + ln ln+ 1

( 1 1+ ω (r 1) σ2

( 1 1+ ω (r 1) σ2

ωv + ln ln+ 1

( 1 1+ ω (r 1) σ2

2 + 1 ln(1/ωv)

ln(1/ωv)( v i,R 3

ln(1/ω v))2 + D

(2) if r ri + 1, for any S R that contains item i,

ω v S 3 , (r 1) σ2 v S 3

ω v A 3 , ω 1 + ω (r 1) σ2

p=1 1 {Ap R} 1

i Ap, Ti(p 1) mj

v Ap 3 , (r 1) σ2 v Ap 3

p=1 1 {Ap R} 1 i Ap, Ti(p 1) mj

ω v i,R 3 , ωv

ω v i,R 3 , ωv

ω v i,R 3 )2

ωv + ln ln+ 1

ω v i,R 3 )2 + D

ω v i,R 3 )2

ωv + ln ln+ 1

ω v i,R 3 )2 + D

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

ln(1/ω v))2

2 + 1 ln(1/ωv)

ln(1/ωv)( v i,R 3

ln(1/ω v))2 + D

Therefore, if we denote

g R,2(r, v i,R) :=

2 + 1 ln(1/ωv)

ln(1/ωv)( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

ln(1/ω v))2

2 + 1 ln(1/ωv)

ln(1/ωv)( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

then (19) can be upper bounded by X

i E g R,2(r, v i,R).

In conclusion, for i E, we denote v i,R := min S i,S R v S, ωvv := max{ωv, ω v}, ωsum := q

g R(r, v i,R) :=

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

ln(1/ω v))2

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

p=1 1 {Ap R} 1 n U v Ap(p 1) > r σ2o X

i E g R(r, v i,R)

Case 4: Ap Sc B

Under this case, there will be a comparison among ωµ, ωv and ω v thus we denote ωmax = max{ωµ, ωv, ω v} and ω1 = r

ln 1 ωmax ln 1 ωµ , ω2 = r

ln 1 ωmax ln 1 ωv , ω3 = r

ln 1 ωmax ln 1 ω v . For S Sc B, we denote S := max n ω1 S, ω3 v S 3 o .

For i E, denote

i,Sc B := min S i,S Sc B max

ln(1/ωµ) , v S 3 p

1 ln(1/ωmax) min S i,S Sc B S.

and assume i,Sc B (ri σ2/3

ln(1/ω v), (ri + 1) σ2/3

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

p=1 1 {Ap Sc B} 1 n U v Ap(p 1) > r σ2o 1 {E}

p=1 1 {Ap Sc B} 1

v Ap 3 , ω v

(r 1) σ2 v Ap 3 , ωv

p=1 1 {Ap Sc B} 1

Fp Ap, ωµ , Fp

v Ap 3 , ω v

(r 1) σ2 v Ap 3 , ωv

p=1 1 {Ap Sc B} 1

ω1 Ap, ω3 v Ap 3 , ω2 (r 1) σ2 v Ap 3

p=1 1 {Ap Sc B} 1 Fp

max max ω1 Ap, ω3 v Ap 3

, ω2 (r 1) σ2

ω3 max ω1 Ap, ω3 v Ap 3

j N 1 {Ap Sc B} 1 Gj,p

max Ap, ω2 (r 1) σ2

j N 1 {Ap Sc B} 1 bj K

i Ap 1 Ti(p 1) mj

max Ap, ω2 (r 1) σ2

p=1 1 {Ap Sc B} 1 i Ap, Ti(p 1) mj

max Ap, ω2 (r 1) σ2

(1) if r ri + 2, for any S Sc B that contains item i,

max S, ω2 (r 1) σ2

max S, ω2ω3 ω2 + ω3

ω2ω3 ω2 + ω3

ln(1/ωmax) i,Sc R

where the first inequality uses the fact that for x, y R+ and z [0, 1], , max{x, y} max{x, xz + y(1 z)}. We take x = Ap, y = ω2 (r 1) σ2

ω3 Ap and z = ω2 ω2+ω3 .

Similar to the computations for the case Ap R, we have

p=1 1 {Ap Sc B} 1 i Ap, Ti(p 1) mj

max Ap, ω2 (r 1) σ2

p=1 1 {Ap Sc B} 1 i Ap, Ti(p 1) mj

ω2+ω3 (r 1) σ2

2 ln 1 ωmax + ln ln+ 1

ω2+ω3 (r 1) σ2

ω2+ω3 (r 1) σ2

2 ln 1 ωmax + ln ln+ 1 ln(1/ωmax)( i,Sc R)2 + D

( (r 1) σ2/3

ln(1/ω v))2

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc R)2 + D

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

(2) if r ri + 1, for any S Sc B that contains item i,

max S, ω2 (r 1) σ2

max S, ω2ω3 ω2 + ω3

ln(1/ωmax) i,Sc R

Similar to the computations for the case Ap R, we have

p=1 1 {Ap Sc B} 1 i Ap, Ti(p 1) mj

max Ap, ω2 (r 1) σ2

p=1 1 {Ap Sc B} 1 n i Ap, Ti(p 1) mj p

ln(1/ωmax) i,Sc R, ωmax o

ln(1/ωmax) i,Sc B)2

2 ln 1 ωmax + ln ln+ 1

ln(1/ωmax) i,Sc B)2 + D

= CγK ( i,Sc B)2

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc R)2 + D

In conclusion, for i E, we denote

i,Sc B := min S i,S Sc B max

ln(1/ωµ) , v S 3 p

and ωmax := max{ωµ, ωv, ω v} and ωsum := q

ωv and g Sc B(r, i,Sc B) :=

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc R)2 + D , r ωsum i,Sc B

CγK ( i,Sc R)2

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc R)2 + D , r ωsum i,Sc B

p=1 1 {Ap Sc B} 1 n U v Ap(p 1) > r σ2o X

i E g Sc B(r, i,Sc B)

Conclusion of Step 2:

For S : denote

g S (r, v S ) := 9 CγK ((r 1) σ2 + v S )2

ωv + ln ln+ 1 v S 2 + D (20)

p=1 1 {Ap = S } 1 n U v Ap(p 1) > r σ2o X

i S g S (r, v S )

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

For S B: for i E, denote i,S B := min S i,S S B max S

ln(1/ωµ), v S 3

, ωµv := max{ωµ, ωv} and

g S B(r, i,S B) (21)

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv) 2 i,S B + D

CγK 2 i,S B

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv) 2 i,S B + D

p=1 1 {Ap S B} 1 n U v Ap(p 1) > r σ2o X

i E g S B(r, i,S B)

For R: for i E, denote v i,R := min S i,S R v S, ωvv := max{ωv, ω v}, ωsum := q

g R(r, v i,R) (22)

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

ln(1/ω v))2

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

p=1 1 {Ap R} 1 n U v Ap(p 1) > r σ2o X

i E g R(r, v i,R)

For Sc B: for i E, denote

i,Sc B := min S i,S Sc B max

ln(1/ωµ) , v S 3 p

and ωmax := max{ωµ, ωv, ω v} and ωsum := q

g Sc B(r, i,Sc B) (23)

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc B)2 + D , r ωsum i,Sc B

CγK ( i,Sc B)2

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc B)2 + D , r ωsum i,Sc B

p=1 1 {Ap Sc B} 1 n U v Ap(p 1) > r σ2o X

i E g Sc B(r, i,Sc B)

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

According to the results in Step 2, given r [Q 1], the event Up(r) can happen at most

p=1 1 {Up(r)} (24)

i S g S (r, v S ) + X

i E g S B(r, i,S B) + X

i E g R(r, v i,R) + X

i E g Sc B(r, i,Sc B)

phases, where the g functions are defined in (20), (21), (22) and (23). By Lemma B.4, (1) event Up(r) indicates r + 1 np, thus (24) also indicates at most in this number of phases there are at least r + 1 being pulled at each phase. (2) event Up(r) Up(r + 1)c indicates r + 1 np 2r + 1, i.e., there are at least r + 1 and at most 2r + 1 solutions being pulled at each phase. For r [Q], we denote

i S g S (r, v S ) + X

i E g S B(r, i,S B) + X

i E g R(r, v i,R) + X

i E g Sc B(r, i,Sc B), r [Q 1] (25)

and T Q := 0, T 0 = . Note that the g functions are increasing as r decreases, so if there exists an r [Q], such that T [T r , T r 1) (or T T r 1) , then

p=1 1 {Up(r)}

p=1 1 {Up(r)} +

p=1 1 {Up(r)}

i S g S (r, v S ) + X

i E g S B(r, i,S B) + X

i E g R(r, v i,R) + X

i E g Sc B(r, i,Sc B)

= T (r 1) +

T (r 1) + X

i S h S (r , v S ) + X

i E h S B(r , i,S B) + X

i E h R(r , v i,R) + X

i E h Sc B(r , i,Sc B)

= T (r 1) + H(r , Λ)

H(r , Λ) := X

i S h S (r , v S ) + X

i E h S B(r , i,S B) + X

i E h R(r , v i,R) + X

i E h Sc B(r , i,Sc B) (26)

and the h functions are defined at the end of the proof. This indicates, when T T r 1, the upper bound of the highprobability regret due to safeness-checking R2(T ) (hence the the total regret) is lower bounded by a linear function with slope r 1 . In particular, the upper bound of R2(T ) is lower bounded by a linear function when T T 1 and it remains a constant when T > T 1. We can compute an upper bound for the number of solutions being pulled during these T phases when T [T r , T r 1):

T Q 1 (2Q 1) + (T Q 2 T Q 1) (2Q 3) + + (T r T r +1) (2r + 1) + (T T r )(2r 1)

= T + 2 T Q 1 (Q 1) + (T Q 2 T Q 1) (Q 2) + + (T r T r +1) r + (T T r )(r 1)

= (2r 1)T + 2

(2r 1)T + 2H(r , Λ)

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Thus, the number of pulled solutions is at most 2H(1, Λ).1

In conclusion, the regret due to safeness-checking can be upper bounded by

R2(T ) = 2µ Q 1 X

p=1 1 {Up(r)}

i S h S (r , v S ) + X

i E h S B(r , i,S B) + X

i E h R(r , v i,R) + X

i E h Sc B(r , i,Sc B)

= 2µ T (r 1) + 2µ H(r , Λ)

where T [T r , T r 1) with T r defined in (25), for i E, i,S B := min S i,S S B max S

ln(1/ωµ), v S 3

v i,R := min S i,S R v S and i,Sc B := min S i,S Sc B max S

ln(1/ωµ), v S 3

The h functions:

(For convenience, we restate the notations: ωµv := max{ωµ, ωv}, ωvv := max{ωv, ω v}, ωmax := max{ωµ, ωv, ω v} and

For S : for each i S

h S (r , v S ) :=

r=r g S (r, v S ) (27)

9 CγK ((r 1) σ2 + v S )2

ωv + ln ln+ 1 v S 2 + D

18 CγK (r 1) σ4

ωv + ln ln+ 1 ( v S )2 + D , r 2

ωv + ln ln+ 1 ( v S )2 + D , r = 1

For S B: for each i E, there is a changing point

+ 2 in g S B(r, i,S B), thus

h S B(r , i,S B) :=

r=r g S B(r, i,S B) (28)

1Note that the upper bound for the regret is 2µ T (r 1) + 2µ H(r , Λ) and the upper bound for the number of solutions is (2r 1)T + 2H(r , Λ), which indicates we can roughly use min{Tµ , 2µ H(r , Λ)} to bound the regret due to safeness-checking with T time steps.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv) 2 i,S B + D

2 CγK (r 1)( σ2

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv) 2 i,S B + D

+ 2 r < Q 1

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv) 2 i,S B + D

ln(1/ωv) i,S B r 1

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv) 2 i,S B + D

, otherwise

For R: for each i E, there is a changing point ωsum v i,R

ln(1/ω v) σ2

h R(r , v i,R) :=

r=r g R(r, v i,R) (29)

ln(1/ω v))2

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

2CγK (r 1)( σ2 3ωsum )2

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

ωsum v i,R q

+ 2 r < Q 1

ln(1/ω v))2

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

$ ωsum v i,R p

ln(1/ω v) σ2

σ2 3ωsum v i,R 3

ln(1/ω v))2

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

For Sc B: for each i E, there is a changing point j ωsum i,Sc B

σ2/3 k + 2, thus

h Sc B(r , i,Sc B) :=

r=r g Sc B(r, i,Sc B) (30)

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

(Q r ) CγK ( i,Sc B)2

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc B)2 + D , ωsum i,Sc B

2CγK (r 1)( σ2 3ωsum )2

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc B)2 + D , ωsum i,Sc B

+ 2 r < Q 1

3CγK ( i,Sc B)2

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc B)2 + D , ωsum i,Sc B

σ2 3ωsum i,Sc B r 1 ( i,Sc B)2

! 2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( i,Sc B)2 + D , otherwise

B.5. Proofs of Theorem 4.1 and Theorem 5.1

Theorem 4.1 (Problem-dependent upper bound). Let Λ = (E, AK, ν, σ2) be an instance and let {δT } T =1 o(1) be a sequence that satisfies ln(1/δT ) = o(T b) for all b > 0 (i.e., {δT } is not exponentially decaying). Then, PASCOMBUCB is a {δT } T =1-variance-constrained consistent algorithm. More precisely, given a time budget T, the probably anytime-safe constraint is satisfied and the regret of PASCOMBUCB Reg(T) is upper bounded by

min {Tµ , Reg1(T) + Reg2(T)} + Reg3(T),

Reg1(T) = O X

K ln T i,S B,min + X

ci K ln T i,Sc B,min

Reg2(T) = 2µ H ( (Λ)) , Reg3(T) = 2µ (L + 1)

where (Λ) = { v S } { v i,R, Ψi,S B, Φi,Sc B}i E and H ( (Λ)) := H(1, Λ) is defined in (26) in App. B.4.

Proof of Theorem 4.1. According to Lemma 6.1, Lemma 6.2 and Lemma 6.3, and take ωµ = ω v = 1 T 2 and ωv = δT

T 2 , the expected regret of T phases E[R(T )] can be upper bounded as

E[R(T )] E[R1(T )|E] + E[R2(T )|E] + R3(T) (31)

K i,S B,min ln 1

ci K i,Sc B,min ln 1

+ 2µ [T (r 1) + H(r , Λ)]

+ 2µ L + 2µ TL (ξ(ωµ) + 2ξ(ωv) + 2ξ(ω v))

K i,S B,min ln T + X

ci K i,Sc B,min ln T

+ 2µ H(1, Λ)

+ 2µ L + 2µ TL 3ξ(1/T 2) + 2ξ(δT /T 2)

= Reg1(T) + Reg2(T) + Reg3(T)

On the other hand, the high-probability regret (14) can be na ıve bounded as

r=1 (µ µAp,r)

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Therefore, we have

E[R(T )] E[ˆR(T )|E] + Reg3(T) (32)

Tµ + Reg3(T)

(31) and (32) give the final upper bound with T time steps.

Theorem 5.1 (Problem-independent upper bound). Let {δT } T =1 o(1) be a sequence that satisfies ln(1/δT ) = o(T b) for all b > 0. If T > L, for any instance Λ with variance gaps lower bounded by v min S AK v S, the regret of PASCOMBUCB is upper bounded by

KLT ln T + LK2

Proof of Theorem 5.1. We firstly deal with the regret due to suboptimality R1(T ). Let µ be a constant that is to be chosen.

p=1 1{Ap B} Ap

p=1 1{ Ap µ}1{Ap B} Ap +

p=1 1{ Ap < µ}1{Ap B} Ap

p=1 1{ Ap µ}1{Ap B} Ap + T µ

where the second term makes use of the fact that T T w.p. 1.

The first term can be upper bounded by adopting the proof of Lemma 6.3 with the constraint that Ap µ. Thus, for i E \ S , i,S B,min µ. For i E,

i,Sc B,min µ

i,Sc B max{ ω µ, v/3} = max{ µ, v/3} =: µv

where ω := r

ln 1 ω v ln 1 ωµ = 1. The regret due to suboptimality can be upper bounded by

p=1 1{ Ap µ}1{Ap B} Ap X

2CγK i,S B,min

ωµ + ln ln+ 1 2 i,S B,min + D

2ci CγK i,Sc B,min

ω v + ln ln+ 1 ( i,Sc B)2 + D

ωµ + ln ln+ 1 ( µ)2 + D

ω v + ln ln+ 1 ( µv)2 + D

2 ln T + ln ln+ 1 ( µ)2 + D

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

By taking µ = q

T , the regret due to suboptimality is bounded by

p=1 1{ Ap µ}1{Ap B} Ap + T µ

KLT ln(TL) + 8Cγ

ln T ln ln+ T KL ln T + 8Cγ

KLT ln(TL))

where we utilize T L.

We then cope with the regret due to safeness-checking R2(T ). According to Lemma 6.3, we only need to upper bound H(1, Λ), i.e., the h functions defined in (27), (28), (29) and (30).

For h S (1, v S ):

h S (1, v S ) h S (1, v)

ωv + ln ln+ 1 ( v)2 + D

= O K ( v)2 ln 1

= O K ( v)2 ln T

For h S B(1, i,S B), define µv := max µ

ln(1/ωµ), v

. We have i,S B

µv. The threshold

KL ln(T L/δ)

h S B(1, i,S B) h S B(1, µv) (33)

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv)( µv)2 + D ,

ln(1/ωv) µv

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv)( µv)2 + D , 0 <

2 + 1 ln(1/ωµv)

ln ln+ 1 ln(1/ωµv)( µv)2 + D ,

O (Q 1) K ( µv)2

ln(1/ωv) µv

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

In particular, in the asymptotic case where T and ln 1

δT = o(T b), b > 0 (this includes the scenario where δ is fixed with respect to T), we have

h S B(1, µv) = O K ( µv)2

= O K ( v)2 ln T

For h R(1, v i,R), we have v i,R v. Furthermore, ωsum = p

ln(TL/δ) and the changing point ωsum v

ln(1/ω v) σ2

ln(T L/δ)) v

h R(1, v i,R) h R(r , v) (34)

(Q 1) CγK ( v

ln(1/ω v))2

2 + 1 ln(1/ωvv )

ln ln+ 1 ln(1/ωvv )( v

ln(1/ω v))2 + D

ln(1/ω v) σ2

σ2 3ωsum v i,R 3

2 + 1 ln(1/ωvv )

ln(1/ωvv )( v i,R 3

ln(1/ω v))2 + D

ln(1/ω v) σ2

ln(1/ω v))2

2 + 1 ln(1/ωvv )

ln ln+ 1 ln(1/ωvv )( v

ln(1/ω v))2 + D

ln(1/ω v) σ2

(Q 1) K ( v

ln(1/ω v))2

ln(1/ω v) σ2

σ2 3ωsum v i,R 3

ln(1/ω v) σ2

ln(1/ω v))2

ln(1/ω v) σ2

In particular, in the asymptotic case where T and ln 1

δT = o(T b), b > 0, we have

h R(1, v) = O QK

For Sc B: define µv := max µ

ln(1/ωµ), v

. We have i,Sc B µv and

the changing point j ωsum µv

ln(T L/δ)) µv

h Sc B(1, i,Sc B) h Sc B(1, µv) (35)

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( µv)2 + D , ωsum µv

σ2 3ωsum µv

! 2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( µv)2 + D , 0 < ωsum µv

3CγK ( µv)2

2 + 1 ln(1/ωmax)

ln ln+ 1 ln(1/ωmax)( µv)2 + D , ωsum µv

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

O (Q 1) K ( µv)2

σ2 3ωsum µv

, 0 < ωsum µv

In particular, in the asymptotic case where T and ln 1

δT = o(T b), b > 0, we have

h Sc B(1, µv) = O QK

R3(T) = 2µ L + 2µ TL (ξ(ωµ) + 2ξ(ωv) + 2ξ(ω v))

2KL + Kδ + 2KTL 4 2 + ϵ

ϵ 1 T 2 ln(1 + ϵ) 1+ϵ

2KL + Kδ + 4K 2 + ϵ

ϵ 1 ln(1 + ϵ) 1+ϵ

where we utilize T > L and the O notation refers to the fact that the preceding term is bounded as a function of T.

Note that µ K, so Tµ + Kδ TK + Kδ.

In conclusion, according to Theorem 4.1, for any T > L, the problem-independent upper bound is the minimum of TK + Kδ and

KLT ln T) + K

i S h S (1, v) + X

i E h S B(1, µv) + X

i E h R(1, v) + X

i E h Sc B(1, µv)

KLT ln T) + K Kh S (1, v) + Lh S B(1, µv) + Lh R(1, v) + Lh Sc B(1, µv)

KLT ln T) + O K3

+ KL h S B(1, µv) + h R(1, v) + h Sc B(1, µv)

where the h functions are defined in (33), (34) and (35). In the asymptotic case where T and ln 1

δT = o(T b), b > 0 (this includes the scenario where δ is fixed with respect to T), the asymptotic problem-independent upper bound is

KLT ln T) + K

i E O K ( v)2 ln T

( v)2 ln T !

KLT ln T) + O LK2

where we utilize

( v)2 ln T when T is sufficiently large.

C. Proofs of the Lower Bounds

C.1. Preliminaries and the Impossibility Result

Let KL(ν, ν ) denote the KL divergence between distributions ν and ν , and

d(x, y) := x ln x

+ (1 x) ln 1 x

denote the Kullback Leibler (KL) divergence between the Bernoulli distributions Bern(x) and Bern(y).

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Lemma C.1 (Pinsker s and reverse Pinsker s Inequality). Consider two probability mass functions PX, PY defined on the same discrete probability space A [0, 1]. The following inequalities hold:

|EX PX[X] EY PY [Y ]| δ(PX, PY )

1 2KL(PX, PY ) 1 αY δ(PX, PY ), .

where δ(PX, PY ) := sup A A P

a A PX(a) P

a A PY (a) = 1 2 P

a A |PX(a) PY (a)| is the total variational

distance, and αY := mina A:PY (a)>0 Q(a).

Lemma C.2 (Lemma 1 in Kaufmann et al. (2016)). Assume the distributions under instance Λ1 = (E, AK, ν(1), σ2) and instance Λ2 = (E, AK, ν(2), σ2) are mutually absolutely continuous. Given time budget T,

i=1 EΛ1[Ni(T)] KL(ν(1) i , ν(2) i ) sup

E HΛ1 T d PΛ1(E), PΛ2(E) .

where Ni(t) denotes the number of time steps item i is selected up to and including time step t and HΛ1 T is all the possible events generated by instance Λ1 and algorithm π with T time steps.

Lemma C.3. Let solution S containing |S| = m(q < m K) items be a safe solution under instance Λ1 = (E, AK, ν(1), σ2). Each item in S is i.i.d. with reward distribution ν1 , mean µ1 and variance σ2 1 < σ2. Define event E(t,1) = {S is identified as safe after time step t}, E(t,2) = {S is chosen at least once after time step t}, and E(t) = E(t,1) E(t,2). Assume there exists τ T such that PΛ1[E(τ)] 1 δ and PΛ1[E(τ 1,1)] < 1 δ. If τ exists, we have

i S EΛ1[Ni(τ)] sup ν2 E(ν1)

d(δ, 1 δ) KL(ν1, ν2).

Furthermore,

EΛ1[M(τ)] sup ν2 E(ν1)

1 |S| 1 d(δ, 1 δ)

KL(ν1, ν2) := T(ν(1)),

where M(t) is the number of times that a solution S S is sampled up to and include time step t and E(ν1) = {ν2 : the variance associated to ν2 is larger than σ2/|S|}.

Proof. With σ2 2 > σ2/|S|, we construct an alternative instance Λ2 = (E, AK, ν2, σ2), under which each item in S is with reward distribution ν2 , mean µ2 and variance σ2 2, while the distributions of other items remain unchanged.

Define event E(t,1) = {S is identified as safe after time step t}, E(t,2) = {S is chosen at least once after time step t}, and E(t) = E(t,1) E(t,2). Assume there exists τ T such that PΛ1[E(τ)] 1 δ and PΛ1[E(τ 1,1)] < 1 δ. Since S is unsafe under instance Λ2 and all the solutions chosen {St}T t=1 AK are safe with probability at least 1 δ, we have PΛ2[E(t)] < δ for all t T.

We now apply Lemma C.2 to obtain that

i S EΛ1[Ni(τ)] KL(ν1, ν2) d(P1(Ec (τ)), P2(Ec (τ))) d(δ, 1 δ) X

i S EΛ1[Ni(τ)] d(δ, 1 δ)

KL(ν1, ν2).

Since PΛ1[E(τ 1,1)] < 1 δ, we can select at most m 1 items at one time step among the first τ time steps. Therefore,

EΛ1[M(τ)] 1 m 1

i P EΛ1[Ni(τ)] 1 m 1 d(δ, 1 δ)

KL(ν1, ν2).

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Theorem 4.5 (Impossibility result). Let {δT } T =1 o(1) be a sequence that satisfies that there exists b (0, 1] such that ln(1/δT ) = Ω(T b). For any instance Λ, the regret of any algorithm is lower bounded by Ω(T b).

Proof. The proof is similar to the proof of Lemma C.3. We consider an alternative instance Λ2 = (E, AK, ν(2), σ2) with the distributions of the items in the optimal safe solution S changed such that (assume the variances of all the items in S

are changed (increased))

i S (σ(2) i )2 σ2

i.e., under instance Λ2, this solution S is unsafe (thus not optimal safe). The other items remain unchanged.

By a similar argument as the proof of Lemma C.3, we have

i S EΛ1[Ni(τ)] KL(ν(1) i , ν(2) i ) d(δ, 1 δ)

i S EΛ1[Ni(τ)] d(δ, 1 δ)

mini S KL(ν(1) i , ν(2) i )

So the safeness checking of S will take Ω(ln 1 2.4δT ) = Ω(T b)

Recall the probably anytime-safe constraint (1):

P t [T], St S 1 δT .

This indicates at time step t, for any solution S, if PH(0) t [S S] < 1 δT , S will not be selected at this time step. Otherwise, (1) is violated. Therefore, before the safeness of the optimal safe solution S is ascertained, it is not going to be sampled and the instantaneous regret will be lower bounded by min S S B S.

In conclusion, the regret is at least ln 1 2.4δT min S S B S = Ω(T b).

We derive both the problem-dependent and problem independent lower bounds on the K-path semi-bandit problem. The items in the ground set are divided into L0 paths: P1, . . . , PL0, each of which contains K unique items. Path Pj contains items (j 1)K + 1, . . . , j K. Without loss of generality, we assume that L/K is an integer. A set S is a solution if and only if S Pj for some j. In other words, the solution set AK = {S : S Pj, j = 1, . . . , L0}. We let Ni(t) denote the number of time steps item i is selected up to and including time step t, Mj(t) denote the number of time steps a safe subset in path j is selected up to and including time step t, and Sj(t) denote the time steps when a safe subset in path j is selected, i.e.,

s=1 1{i Ss}, Mj(t) =

s=1 1{Ss Pj, Ss is safe }, Sj(t) = {1 s t : Ss Pj, Ss is safe }.

We let Reg[j](t) denote the regret accumulated in Sj(t). Since all the chosen solutions are safe with probability at least 1 δ, we have Reg(T) (1 δ) PL0 j=1 Reg[j](T). In the following, we lower bound the regret accumulated in time steps where safe solutions are chosen.

We will construct several instances to prove each of the lower bounds (which will be specified in the proof) such that under instance k (Λk = (E, AK, ν(k), σ2)), the stochastic reward of items in path j (1 j L0) are i.i.d., i.e. , ν(k) i = ν(k) Pj , i Pj, which will be specified in each case. Under instance k, we define several other notations as follows:

Let Wi(t)(k) be the random reward of arm i at time step t. Let S(k) t be the pulled solution at time step t, and H(j) t = {(S(j) s , {Wi(s)(k)}s S(k) s )}t s=1 be the sequence of selected solutions and observed rewards up to and including time step t.

For simplicity, we abbreviate EH(k) T , PH(k) T , as Ek, Pk respectively.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

C.2. Problem-dependent Lower bound

In order to provide a better understanding of the analysis, we first derive a lower bound with Gaussian distributions (unbounded) in Theorem C.4. With the same technique, we derive a lower bound with bounded Bernoulli distributions in Theorem 4.3, which corroborates with our problem setup.

Theorem C.4 (Problem-dependent lower bound for sub-Gaussian instances). Let {δT } T =1 o(1) be a sequence that satisfies ln(1/δT ) = o(T b) for all b > 0. There exists an instance Λ, for any {δT }T N-variance-constrained consistent algorithm π, the regret is lower bounded by

ln T i,S B,min

K Ω K ln(1/δT )

Ψ i,S B + ln T ( v i,R)2 + Φi,Sc B .

Proof. Given a vanishing sequence {δT }T N, we consider a fixed {δT }T N-variance-constrained consistent algorithm π on the K-path semi-bandit problem. For simplicity, the distributions of the items are assumed to be Gaussian in the proof, but the techniques can be applied to the Bernoulli case and we provide the instance design and the corresponding bound at the end of the proof.

Under instance Λ0 (the base instance), σ2 = 2 σ2

K (so the absolutely safe solutions contain at most K/2 items) and the distributions of the items are

ν(0) i = N(µ(0) j , (σ2 i )(0)) =

ν(0) P1 = N( , σ2 ϵv

ν(0) Pj = N( ϵµ

K ), i Pj, 2 j L1 + 1

ν(0) Pj = N( + ϵµ

K , σ2 + ϵv

K ), i Pj, L1 + 2 j L2 + 1

ν(0) Pj = N( ϵµ

K , σ2 + ϵv

K ), i Pj, L2 + 2 j L0

K and ϵv are small positive constants (e.g. ϵµ = 2K , ϵv σ2

K2 ), and L1 = L2 L1 = L0 L2 1 = L0 1

3 (assume it is an integer). In this case,

path 1 is an optimal safe path, path 2 to path L1 + 1 are the safe and suboptimal paths, path L1 + 2 to path L2 + 1 are the risky paths path L2 + 2 to path L3 + 1 = L0 are the unsafe and suboptimal paths.

In the following, we will compute the minimum regret yielded from each of the paths.

Case 1: the optimal safe path P1

In order to achieve o(T a), a > 0 regret, any algorithm has to identify the safeness of the optimal safe solution P1 and sample P1 Ω(T) times, otherwise, the regret is linear. According to Lemma C.3, the expected number of time steps needed for the safeness identification of P1 is lower bounded by

E0[M1(τ)] sup

ν(1) P1 E(ν(0) P1 )

1 K 1 d(δT , 1 δT )

KL(ν(0) P1 , ν(1) P1 ) := T(ν(0) P1 )

where E(ν(0) P1 ) = {ν(1) P1 : the variance associated to ν(1) P1 is larger than σ2/K}. In particular, we let instance Λ1 = (E, AK, ν(1), σ2) with

ν(1) P1 = N , σ2 + ϵv 1 K

ν(0) i , i / P1

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

where ϵv 1 is an arbitrarily small constant. Thus, we can take supremum over ϵv 1 > 0 and have

T(ν(0) P1 ) 1 K 1 ln 1 2.4δT KL(N( , σ2 ϵv

K ), N( , σ2

K )) 1 K 1 4K2( σ2/K)2

(ϵv)2 ln 1 2.4δT = K(σ2)2

(ϵv)2 ln 1 2.4δT

When a solution S P1 is chosen before this time step, the instantaneous regret is lower bounded by . The accumulative regret from P1 is lower bounded by

Reg[1](T) T(ν(0) P1 ) = Ω( K

( v P1)2 ln 1

Case 2: the safe and suboptimal paths

For any safe and suboptimal path Pj(2 j L1 + 1), we define instance Λj = (E, AK, ν(j), σ2) with

ν(j) Pj = N( + ϵµ j K , σ2 ϵv

ν(0) i , i / Pj

where ϵµ j < is an arbitrarily small constant. So Pj is the optimal safe solution under instance j.

Fix any item i Pj, consider the event Ej = {Ni(T) T

2 }, under instance 1, Ej indicates the optimal safe solution P1 is sampled less than T

2 times; under instance j, Ec j indicates the optimal safe solution Pj is sampled less than T

2 times. Therefore,

RegΛ0(T) + RegΛj(T) T

2 ϵµP0[Ej] + T

2 ϵµ j Pj[Ec j ]

2 min{ϵµ, ϵµ j } P0[Ej] + Pj[Ec j ]

According to Lemma C.1 and Lemma C.2, we have

P0[Ej] + Pj[Ec j ] 1

2 exp{ KL(P1, Pj)}

KL(P1, Pj) =

i=1 E0[Ni(T)] KL ν(0) i , ν(j) i = X

i Pj E0[Ni(T)] KL ν(0) i , ν(j) i

RegΛ0(T) + RegΛj(T) T

2 min{ϵµ, ϵµ j } 1

2 exp{ KL(P1, Pj)}

ln RegΛ0(T) + RegΛj(T)

ln T 1 + min{ϵµ, ϵµ j }/4 ln T KL(P1, Pj)

( ) = KL(P1, Pj)

i Pj E0[Ni(T)]

KL ν(0) Pj , ν(j) Pj = 2 σ2 ϵv

K ( ϵµ+ϵµ j K )2

i Pj E0[Ni(T)]

ln T 2 σ2 ϵv

K )2 =: T(ν(0) Pj )

where in ( ) we let T and note that both RegΛ0(T) and RegΛj(T) are of order o(T a), a > 0; in ( ) we take supremum over ϵµ j > 0.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

We also have to take the safeness constraint into consideration. According to Lemma C.3, in order to check the safeness of Pj, the items in Pj have to be sampled

i Pj EΛ0[Ni(τ)] sup

ν Pj E(ν(0) Pj )

d(δT , 1 δT )

KL(ν(0) Pj , ν Pj) .

i Pj EΛ0[Ni(τ)]

ν Pj E(ν(0) Pj )

KL(ν(0) Pj , ν Pj) d(δT , 1 δT )

ln T 4K2( σ2/K)2

(ϵv)2 ln 1 2.4δT ln T := Tsafe(ν(0) Pj ).

If T(ν(0) Pj ) Tsafe(ν(0) Pj ), it indicates the suboptimality of Pj is identified before the safeness. Furthermore, whenever a solution S Pj is sampled, S can have most K 1 items. So the regret is lower bounded by

ln T T(ν(0) Pj )

K 1 (ϵµ + ϵµ

K ) 2K2 σ2 ϵv

K (ϵµ)2 ( K 1 + ϵµ

= Reg[j](T)

ln T 2K σ2 ϵv

K (ϵµ)2 ( + ϵµ) Kσ2/2

(ϵµ)2 ( + ϵµ)

= Reg[j](T) = Ω

K 2 Pj ( Pj + ) ln T

If T(ν(0) Pj ) Tsafe(ν(0) Pj ), it indicates the suboptimality of Pj is identified after the safeness. Thus, whenever a solution S Pj is sampled, S can have most K 1 items before the safeness checking is finished. So the regret is lower bounded by

ln T Tsafe(ν(0) Pj )

K 1 (ϵµ + ϵµ

K ) + T(ν(0) Pj ) Tsafe(ν(0) Pj )

T(ν(0) Pj )

K ϵµ + ( ϵµ

K ) Tsafe(ν(0) Pj )

K (ϵµ)2 ϵµ + ( ϵµ

K ) 1 K 1 4K2( σ2/K)2

(ϵv)2 ln 1 2.4δT ln T

K ) 1 K 1 K2(σ2)2

(ϵv)2 ln 1 2.4δT ln T

= Reg[j](T) = Ω

K Pj ln T + ( Pj

K ) K ( v Pj)2 ln 1

Based on (36) and (37), the regret is

Reg[j](T) = Ω

K Pj ln T + min

( K 2 Pj ln T, K ( v Pj)2 ln 1

Case 3: the risky paths

For any risky path Pj (L1 + 2 j L2 + 1, we define instance Λj = (E, AK, ν(j), σ2) with

ν(j) Pj = N( + ϵµ

K , σ2 ϵv j K ), i Pj

ν(0) i , i / Pj

where ϵv j is an arbitrarily small constant. So Pj is the optimal safe solution under instance j.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Fix any item i Pj, consider the event Ej = {Ni(T) T

2 }, under instance 1, Ej indicates the optimal safe solution P1 is sampled less than T

2 times; under instance j, Ec j indicates the optimal safe solution Pj is sampled less than T

2 times. Therefore,

RegΛ0(T) + RegΛj(T) T

K ϵµ P0[Ej] + T

2 ϵµPj[Ec j ]

2 ϵµ P0[Ej] + Pj[Ec j ]

According to Lemma C.1 and Lemma C.2, we have

P0[Ej] + Pj[Ec j ] 1

2 exp{ KL(P1, Pj)}

KL(P1, Pj) =

i=1 E0[Ni(T)] KL ν(0) i , ν(j) i = X

i Pj E0[Ni(T)] KL ν(0) i , ν(j) i

RegΛ0(T) + RegΛj(T) T

2 exp{ KL(P1, Pj)}

ln RegΛ0(T) + RegΛj(T)

ln T 1 + min{ϵµ, ϵµ j }/4 ln T KL(P1, Pj)

( ) = KL(P1, Pj)

i Pj E0[Ni(T)]

KL ν(0) Pj , ν(j) Pj = K2(σ2)2

(ϵv)2 =: T(ν(0) Pj )

where in ( ) we let T and note that both RegΛ0(T) and RegΛj(T) are of order o(T a), a > 0.

Note that Pj is a risky path and if any solution S Pj is sampled, |S| K 1. Thus, E[Mj(T)] T (ν(0) Pj )

K 1 ln T. The regret is lower bounded by

Reg[j](T) = Ω K 1

K ϵµ 1 K 1 K2(σ2)2

K ( v Pj)2 ln T

Case 4: the unsafe and suboptimal paths

For any unsafe and suboptimal path Pj(L2 + 2 j L0), we define instance Λj = (E, AK, ν(j), σ2) with

ν(j) Pj = N( + ϵµ j K , σ2 ϵv j K ), i Pj

ν(0) i , i / Pj

where ϵµ j < , ϵv j < σ2

K are arbitrarily small constants. So Pj is the optimal safe solution under instance j.

Fix any item i Pj, consider the event Ej = {Ni(T) T

2 }, under instance 1, Ej indicates the optimal safe solution P1 is sampled less than T

2 times; under instance j, Ec j indicates the optimal safe solution Pj is sampled less than T

2 times. Therefore,

RegΛ0(T) + RegΛj(T) T

2 ϵµP0[Ej] + T

2 ϵµ j Pj[Ec j ]

2 min{ϵµ, ϵµ j } P0[Ej] + Pj[Ec j ]

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

According to Lemma C.1 and Lemma C.2, we have

P0[Ej] + Pj[Ec j ] 1

2 exp{ KL(P1, Pj)}

KL(P1, Pj) =

i=1 E0[Ni(T)] KL ν(0) i , ν(j) i = X

i Pj E0[Ni(T)] KL ν(0) i , ν(j) i

RegΛ0(T) + RegΛj(T) T

2 min{ϵµ, ϵµ j } 1

2 exp{ KL(P1, Pj)}

ln RegΛ0(T) + RegΛj(T)

ln T 1 + min{ϵµ, ϵµ j }/4 ln T KL(P1, Pj)

( ) = KL(P1, Pj)

i Pj E0[Ni(T)]

KL ν(0) Pj , ν(j) Pj = 4K2

( σ2 ϵv j)/K

+ 2(ϵµ j + ϵµ)2

( σ2 ϵv j)/K

i Pj E0[Ni(T)]

ln T 4K2 ϵv

=: T(ν(0) Pj )

where in ( ) we let T and note that both RegΛ0(T) and RegΛj(T) are of order o(T a), a > 0; in ( ) we take the supremum over ϵµ j > 0, ϵv j > 0.

Note that Pj is an unsafe path and if any solution S Pj is sampled, |S| K 1. Thus, E[Mj(T)] T (ν(0) Pj )

K 1 ln T. The regret is lower bounded by

Reg[j](T) + K 1

= Ω K + Kϵµ

max{ϵµ, ϵv}2 ln T

( K 2 Pj ln T, K ( v Pj)2 ln T

In conclusion, the regret yielded from these paths is lower bounded by

1 δT Reg[1](T) +

j=2 Reg[j](T) +

j=L1+2 Reg[j](T) +

j=L2+2 Reg[j](T)

ϵµ ln T + min K

(ϵµ)2 ln T, K (ϵv)2 ln 1

(ϵv)2 ln T +

j=L2+2 Ω min K

(ϵµ)2 ln T, K

(ϵv)2 ln T .

L1 = L2 L1 = L0 L2 1 = L0 1

K and µ = K . for S , ϵv = v S .

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

for S S B and i S, we check that if S Pj, 2 j L1 + 1 , i,S B,min = ϵµ and

Ψ i,S B min ln T

2 S , 9 ln(1/δT )

where the equality holds when S = Pj, 2 j L1 + 1. for S R and i S, ϵv = v i,R. S Sc B and i S, we can easily check S = Pj for some L2 + 2 j L0, thus i,Sc B,min = ϵµ and

Φi,Sc B = min ln T

2 S , 9 ln T

Reg(T) Ω L ln T i,S B,min

K ln(1/δT )

i E Ψ i,S B + X

ln T ( v i,R)2 + X

i E Φi,Sc B

Theorem 4.3 (Problem-dependent lower bound). Let {δT } T =1 o(1) be a sequence that satisfies ln(1/δT ) = o(T b) for all b > 0. There exists an instance Λ such that for any {δT }T N-variance-constrained consistent algorithm π, the regret is lower bounded by

ln T i,S B,min

K Ω K ln(1/δT )

Ψ i,S B + ln T ( v i,R)2 + Φi,Sc B .

Proof. We divide the whole ground set into several paths. Meanwhile, we define four sets E1, . . . , E4 such that Ei Ej = for any i = j. P1 = E1, Pj Ej for j = 2, 3, 4. For any j > 4, there exists k = 1 such that Pj Ek. For any path Pj, if Pj E4, |Pj| = K2; otherwise, |Pj| = K1. Pj consists of arms

i=1 |Pi| + 1, . . . ,

i=1 |Pi| + K1 1{Pj E4} + K2 1{Pj E4}.

The feasible solution set AK consists of all subsets of each path.

We let Bern(a) denote the Bernoulli distribution with parameter a(a (0, 1)). Note that the variance of Bern(a) is a(1 a). We construct an instance with νi = Bern(µi). With ε1, ε2 > 0, we set

µi = ε1 i E2,

µi = + ε2 i E3,

µi = ε3 i E4.

We let 2 K1 < K2 K, < 1/2 and

K1 (1 ) < σ2 (paths in E1 and E2 are safe),

K1( + ε2)(1 ε2) > σ2 (paths in E3 are unsafe),

K2( ε3)(1 + ε3) > σ (paths in E4 are unsafe),

K2( ε3) < K1 (paths in E4 are suboptimal),

(K1 1) ( + ε2) < K1 (K1 1) ε2 < ε2 < K1 1 (paths in E3 are suboptimal or unsafe).

The conditions above indicate that

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

P1 is the unique optimal safe set; if Pj E2, Pj and its subsets are safe but suboptimal; if Pj E3, Pj is risky, and its proper subsets are suboptimal; if Pj E4, Pj and its subsets are suboptimal, and Pj is unsafe.

2 and p2 := 1 p

The relations between µi s are as in the following figure:

With a similar proof to that of Theorem C.4, we have, the accumulative regret from the optimal P1 is lower bounded by

Reg[1](T) Ω( K1

( v P1)2 ln 1

the accumulative regret from a safe and suboptimal path in E2 is lower bounded by

Reg[j](T) Ω

K1 Pj ln T + min

( K1 2 Pj ln T, K1 ( v Pj)2 ln 1

the accumulative regret from a risky path in E3 is lower bounded by

Reg[j](T) Ω

K1 ( v Pj)2 ln T

the accumulative regret from a unsafe and suboptimal path in E4 is lower bounded by

Reg[j](T) Ω

2 Pj ln T, K2

( v Pj)2 ln T

ϵµ = min j { Pj}, ϵv = min j { v Pj};

and set K1, K2, ε1, ε2, ε3 such that

4 K2, K2 = K, min j { Pj} < 2ϵµ, min j { v Pj} < 2ϵv.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

In conclusion, the regret yielded from these paths is lower bounded by

1 δT Reg[1](T) + X

Pj E2 Reg[j](T) + X

Pj E3 Reg[j](T) + X

Pj E4 Reg[j](T)

ϵµ ln T + min K

(ϵµ)2 ln T, K (ϵv)2 ln 1

(ϵv)2 ln T + |E4| Ω min K

(ϵµ)2 ln T, K

(ϵv)2 ln T .

µ = K1 . for S , ϵv = v S . for S S B and i S, we check that if S Pj, j E2 , i,S B,min = ϵµ and

Ψ i,S B min ln T

2 S , 9 ln(1/δT )

where the equality holds when S = Pj, Pj E2. for S R and i S, ϵv = v i,R. S Sc B and i S, we can easily check S = Pj for some Pj E4, thus i,Sc B,min = ϵµ and

Φi,Sc B = min ln T

2 S , 9 ln T

Reg(T) Ω L ln T i,S B,min

K ln(1/δT )

i E Ψ i,S B + X

ln T ( v i,R)2 + X

i E Φi,Sc B

C.3. Problem-independent Lower bound

Theorem 5.2 (Problem-independent lower bound). Let the minimum variance gap be v := min S AK v S. When K3 L2, we have

KLT + min n L ( v)2 ln 1

Proof. Since the rewards of items are bounded in [0, 1], the variance of each arm is at most 1/4. Therefore, when σ2 [K/4, ), any solution in AK is safe and there exists a generic lower bound (Kveton et al., 2015). When σ2 (0, K/4), let

2 and a := 1 p

We consider the instances containing items with Bernoulli reward distributions. We let Bern(a) denote the Bernoulli distribution with mean a. An item i [L] with reward distribution Bern(µi) is with variance µi(1 µi), which is smaller than σ2/K if and only if µi (0, a) ( a, 1).

We will construct 3 instances such that under instance k (0 k 2), the stochastic reward of an item in path j (1 j L0) is drawn from distribution ν(k) j = Bern(µ(k) j ), where µ(k) j will be specified in each case.

Under instance 0, with µ0 < a, we let µ(0) j = µ0 for all j, i.e., the reward distribution of each item is Bern(µ0). Since µPj = Kµ0 and σ2 Pj = Kµ0(1 µ0) < σ2 for all j, each path is an identical safe and optimal solution. Since all paths

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

are equivalent under instance 0, we have E0[Mj(t)] = T/L0 for all j 1, . . . , L0, where E0 denote the expectation under instance 0.

We next construct instance 1 such that

µ(1) 1 = µ1, µ(1) j = µ0 j = 1,

where µ0 < µ1 < a.2 Hence, P1 is the unique optimal safe solution under instance 1 while all other solutions are safe but suboptimal.

With an analysis similar to that of Lemma 6.4 in Zhong et al. (2021), we can show that

Lemma C.5. Let the reward distribution of item i be ν(j) i under instance j(j = 1, 2), then

KL H(1) T , H(2) T =

i=1 E0[Ni(T)] KL ν(1) i , ν(2) i .

Hence, we have

KL H(0) T , H(1) T = X

i P1 E0[Ni(T)] ν(0) i , ν(1) i (a) K E0[M1(t)] d(µ0, µ1) = K T

L0 d(µ0, µ1),

where (a) follows from the fact that at most K items are selected at one time step and the definition of instances.

Next, we apply Pinsker s Inequality to bound E1[M1(T)]. Lemma C.1 indicates that

|E0[M1(T)] E1[M1(T)]|

1 2KL H(0) T , H(1) T ,

E1[M1(T)] T

KT 2L0 d(µ0, µ1).

Moreover, since the paths Pj for j = 1 are identical under instance 1, we have

E1[Mj(T)] = 1 L0 1 T E1[M1(T)] 1 L0 1

KT 2L0 d(µ0, µ1)

To learn the regret incurred by P2 under instance 1, we need to take the effects of the safety constraint into consideration. Lemma C.3 indicates that

if M < T(ν0), at each of the first M time steps in S2(T), at most K 1 items are pulled and regret of at least [K(µ1 µ2) + µ2] M is incurred, i.e.,

Reg[2] [K(µ1 µ2) + µ2] M.

if M T(ν0), at each of the first T(ν0) time steps in S2(T), at most K 1 items are pulled and regret of at least [K(µ1 µ2) + µ2] T(ν0) is incurred; besides, at the subsequent time steps in S2(T), regret of at least K(µ1 µ2) [M T(ν0)] is incurred, i.e.,

Reg[2] [K(µ1 µ2) + µ2] T(ν0) + K(µ1 µ2) [M T(ν0)] = K(µ1 µ2) M + µ2 T(ν0).

In short, we have

Reg[2] K(µ1 µ2) M + µ2 min{T(ν0), M}.

We can lower bound Reg(j) for all j = 3, . . . , L0 with the same method.

2In this proof, µ1 are µ2 are redefined to minimize clutter; the previous definitions of them not used.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

Besides, since

E1[M1(T)] T

KT 2L0 d(µ0, µ1)

and at most K 1 items are selected at each of the first T(ν1) time steps in S1(T), we have

Reg[1] µ1 min T(ν1), T

KT 2L0 d(µ0, µ1) .

Therefore, under instance 1, we have

µ1 min T(ν1), T

KT 2L0 d(µ0, µ1) + (L0 1) [K(µ1 µ2) M + µ2 min{T(ν0), M}]

= µ1 min sup ν 1 E(ν1)

1 K 1 d(δ, 1 δ)

KL(ν1, ν 1), T

KT 2L0 d(µ0, µ1)

+ (L0 1) K(µ1 µ2) M + µ2 min sup ν 0 E(ν0)

1 K 1 d(δ, 1 δ)

KL(ν0, ν 0), M

E(ν0) = {ν(0 ) : the variance related to ν(0 ) is larger than σ2/K},

E(ν1) = {ν(1 ) : the variance related to ν(1 ) is larger than σ2/K},

KT 2L0 d(µ0, µ1)

By Pinsker s inequality (see Lemma C.1), for µi 1/2,

d(µi, a) (µi a)2 (1 µi a)2

a(1 µi a)2 = [µi(1 µi) σ2/K]2

a(1 µi a)2 [µi(1 µi) σ2/K]2

a( a 1/2)2 = [µi(1 µi) σ2/K]2

a(1/2 a)2 ;

for µi < 1/2,

d(µi, a) (µi a)2 (1 µi a)2

a(1 µi a)2 = [µi(1 µi) σ2/K]2

a(1 µi a)2 [µi(1 µi) σ2/K]2

a(1/2 a)2 .

Since ν0 = Bern(µ0), ν1 = Bern(µ1), and 0 < µ0 < µ1 < a < 1/2, we have

sup ν 0 E(ν0)

1 K 1 d(δ, 1 δ)

KL(ν0, ν 0) = d(δ, 1 δ)

K 1 (1/2 a)2

[µ0(0 µ0) σ2/K]2 ,

sup ν 1 E(ν1)

1 K 1 d(δ, 1 δ)

KL(ν1, ν 1) = d(δ, 1 δ)

K 1 (1/2 a)2

[µ1(1 µ1) σ2/K]2 .

We define the minimum variance gap v := min S A v S. When K3 L2 and LK/T a2, we can let µ1 µ2 = p

L/KT. Then we have

Reg(T) = Ω min 1 K 1 d(δ, 1 δ)

( v/K)2 , T +

KLT + L0 min 1 K 1 d(δ, 1 δ)

( v/K)2 , TK

= Ω min K d(δ, 1 δ)

( v/)2 , T +

KLT + L min d(δ, 1 δ)

KLT + min L d(δ, 1 δ)

( v)2 , T .

Moreover, we complete the proof with d(δ, 1 δ) ln(2.4δ) and δT = δ.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

D. Tightness of the Upper bound

Corollary D.1 (Tightness of problem-dependent bounds). Let {δT } T =1 o(1) be a sequence that satisfies ln(1/δT ) = o(T b) for all b > 0,

if ln(1/δT ) o(ln T), in particular, if δT = δ0 > 0 for all T, the regret Reg(T) is

ln T i,S B,min + µ ln T/K

( v i,R)2 + µ

K ln T i,S B,min + µ K ln T

( v i,R)2 + µ KΦi,Sc B

if ln(1/δT ) = λ ln T, i.e., δT = T λ with a fixed λ > 0, the regret Reg(T) is

ln T i,S B,min + λµ ln T

( v S )2 + µ

Ψ i,S B + ln T ( v i,R)2 + Φi,Sc B !

K ln T i,S B,min + λµ K2 ln T

( v S )2 + Kµ X

Ψi,S B + ln T ( v i,R)2 + Φi,Sc B !

where ln(1/δT ) in the Ψ and Ψ functions should be replaced by λ ln T. if ln(1/δT ) ω(ln T), the regret Reg(T) is

Reg(T) Ω µ ln(1/δT )

O µ K2 ln(1/δT )

The upper bounds above are achieved by PASCOMBUCB.

E. Additional Discussions and Future Research

E.1. Discussions on the Tightness Results

In terms of the problem-dependent bounds in Corollary 4.4, we consider general instances where the rewards from the items are independent and the gap in Reg1(T) can be closed when that the rewards from the items are correlated, as in the lower bound for the unconstrained combinatorial bandits in Kveton et al. (2015). This assumption also allows us to remove a factor of K from the gap of Reg2(T).

In terms of the problem-independent bounds in Corollary 5.4, the regret due to suboptimality is almost tight as in the unconstrained case (Kveton et al., 2015). The regret due to safeness checking is tight up to a factor of K2. During each phase, PASCOMBUCB selects and samples solutions which are disjoint subsets of Ap, and hence one item is sampled at most once during one phase. However, it is empirically feasible to sample some items more than once during one phase, which will help reduce the regret but requires more delicate analysis.

For future directions, it is of interest to close the gap (the factor K) in the regret due to safeness checking with improved analyses or additional assumptions on the instance.

E.2. Discussions on the Problem Formulation

Anytime safety is important in safety-critical applications where at each point in time the risk cannot exceed a certain threshold. For example, in a self-driving car that is scheduled to move from start point x0 to end point xn via (x1, x2, . . . , xn 1) (the choice of these waypoints is a combinatorial problem), it is necessary that the car stay in its designated lane at all points in time, and not just on average , otherwise a catastrophic accident might result at some point in time with non-negligible probability. In this example, we might want to choose a route that is safe at all times w.h.p. (in the sense that the car stays in its designated lane) at the cost of a longer travel time.

This work studies the anytime-safe constraint at the super-arm3 level and provides a first step to understanding risk in combinatorial semi-bandits. The sum P

i A σ2 i of a set of items (base arms) in A is adopted as the risk measure, which is a

3In this discussion, the terms super arm and base arm refer to solution and item respectively.

Probably Anytime-Safe Stochastic Combinatorial Semi-Bandits

certain function of σ2 i , i A. It is also of interest to study the anytime safety at the base arm level, where the risk function is maxi A σ2 i . From a technical point of view, if the safeness of any single base arm has not been ascertained (as we need to learn the safeness/risk of the base arms), then pulling any base arm can be risky, in the sense that with non-vanishing probability, the anytime-safe constraint (or even the less stringent stagewise safety constraint (Khezeli & Bitar, 2020)) is violated when we do the exploration (by pulling any base arm) at the beginning. Thus, this seems infeasible from a technical standpoint. Nevertheless, we believe additional proper assumptions can be made to formulate a practical and feasible model that leads to future researches.

E.3. Comparisons

Comparison with Wu et al. (2016); Kazerouni et al. (2017): While the conservative (linear) constraint

k=1 Xk, θ (1 α)tb0

requires the constraint should be satisfied w.p. 1 δ over the whole horizon, which is similar to our probably anytime-safe constraint, the constraint is in terms of the cumulative expected reward (up to time step t). The cumulative nature of the conservative constraint maintains a budget reservoir that makes this constraint less stringent than the stagewise safety constraint (Khezeli & Bitar, 2020), in the sense that one algorithm may satisfy the conservative constraint but violate the stagewise safety constraint. Both the stagewise safety constraint and the probably anytime-safe constraint consider the reward/risk that is incurred at each single time step.

Comparison with Khezeli & Bitar (2020); Moradipari et al. (2020): To the best of our knowledge, the stagewise safety (or the stagewise conservative) constraint (Khezeli & Bitar, 2020; Moradipari et al., 2020) is the most related risk-aware constraint to our anytime-safe constraint. (1) The stagewise conservative constraint is a constraint on the mean reward (hence, only one statistics is involved in the problem), which originates from the conservative constraint (Wu et al., 2016). In our setup, we post the anytime-safe constraint on the risk while minimizing the regret, which requires us to consider two statistics and the interaction between them. (2) The stagewise safety constraint has only been utilized under the linear bandit setup in the literature, where the arm set is assumed to be a convex and compact set in Rn (Moradipari et al., 2020), and thus, the arms constitute an uncountable continuous set. If an arm A is known to be safe, then it is safe to pull any arm near A. However, in the combinatorial bandit setup, such a continuity property of the arm set does not hold since it is discrete . Specifically, given that super arm A is safe (but not absolutely safe), even the safeness of a nearby arm A, which is obtained by replacing one single base arm in A with another base arm, cannot be guaranteed by the safeness of A. (3) In terms of the safety level, consider the stagewise safety constraint with a constant confidence parameter δ (independent of T); intuitively, the safety constraint can be violated approximately δT times, which is linear in T. In addition, consider an algorithm which does safeness checking first, followed by exploration-and-exploitation on the safe super arms, it takes Θ( 1 ( v A)2 ln 1

δ ) pulls to identify the unsafeness of an unsafe super arm A. Note that the time required is independent of T, so the regret due to safeness checking is o(T a) for all a > 0. From this perspective, the stagewise safety constraint is not stringent enough and can be easily satisfied by such a naive algorithm. A more direct intuition (yet not completely rigorous) is, if the algorithm ignores the stagewise safety constraint, it only takes o(T a) for all a > 0 to rule out the unsafe super arms, which indicates it satisfies the stagewise safety constraint (since the unsafe super arms are pulled o(T a) < δT times). From another point of view, given a confidence parameter δ, if the stagewise safety constraint is satisfied w.p. 1 δt at time step t with PT t=1 δt = δ, then the probably anytime-safe constraint is met w.p. 1 δ. (4) Besides the constant confidence parameter δ, we have investigated the whole spectrum of δ in terms of the time horizon T. The tightness result (Corollary D.1) indicates our algorithm is capable of dealing with an even stricter constraint (in the sense that δ decreases with respect to T) and we provide a sharp threshold on the achievability of o(T a) (for all a > 0) regret (see Theorem 4.5).