# robust_contextual_bandits_via_bootstrapping__2e0d0599.pdf

Robust Contextual Bandits via Bootstrapping

Qiao Tang, Hong Xie, Yunni Xia, Jia Lee, Qingsheng Zhu Chongqing Key Laboratory of Software Theory and Technology, Chongqing University felicity 0719@163.com, xiehong2018@cqu.edu.cn, xiayunni@hotmail.com, {lijia,qszhu}@cqu.edu.cn

Upper conﬁdence bound (UCB) based contextual bandit algorithms require one to know the tail property of the reward distribution. Unfortunately, such tail property is usually unknown or difﬁcult to specify in real-world applications. Using a tail property heavier than the ground truth leads to a slow learning speed of the contextual bandit algorithm, while using a lighter one may cause the algorithm to diverge. To address this fundamental problem, we develop an estimator (evaluated from historical rewards) for the contextual bandit UCB based on the multiplier bootstrapping technique. We ﬁrst establish sufﬁcient conditions under which our estimator converges asymptotically to the ground truth of contextual bandit UCB. We further derive a second order correction for our estimator so as to obtain its conﬁdence level with a ﬁnite number of rounds. To demonstrate the versatility of the estimator, we apply it to design a Boot Lin UCB algorithm for the contextual bandit. We prove that the Boot Lin UCB has a sub-linear regret upper bound and also conduct extensive experiments to validate its superior performance.

Introduction Contextual bandit is a popular online learning framework and has been applied to solve many real-world problems, i.e., it has been applied to recommender systems to recommend products to interactive users (Li et al. 2010; Wang, Wu, and Wang 2016; Zhang et al. 2019), applied to optimize information retrieval algorithms (Hofmann et al. 2011; Gampa and Fujita 2019; Glowacka et al. 2019), as well as applied to networking applications such as selecting edge servers in mobile edge computing systems (Ouyang et al. 2019). Also, numerous variants of contextual bandit were developed (Filippi et al. 2010; Wang, Wu, and Wang 2016; Krishnamurthy, Wu, and Syrgkanis 2018; Zhang et al. 2019). To illustrate, consider the following example of a simpliﬁed contextual bandit:

Example 1. Consider a ﬁnite number of arms indexed by a A {1, . . . , A}, where A N+. Arm a is associated with a feature vector xa Rd, where d N+. The reward model for arm a is

Ra = x T a θ + Wa,

Copyright 2021, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

where θ Rd and Wa is random variable with normal distribution N(0, σ2 a). In each round t N+, the decision maker selects an arm at At A and receives the reward, which is a sample from Rat. The objective is to design an arm selection algorithm to attain as large as possible the cumulative reward in T N+ rounds.

For the bandit feedback illustrated in Example 1, contextual bandit algorithms need to balance the exploitation vs. exploration trade-off. One popular class of contextual bandit algorithms use the upper conﬁdence bound (UCB) approach to balance this trade-off (Abbasi-Yadkori, P al, and Szepesv ari 2011; Auer 2002; Chu et al. 2011; Lattimore and Szepesv ari 2020). The following example illustrates one of the UCB approaches for Example 1.

Example 2. Consider the setting of Example 1. In round t, selects the arm at arg maxa At Ut(a), where Ut(a) is the UCB associated with arm a and it is deﬁned as

Ut(a) x T a bθt + φt(xa, σ, Ht 1),

where σ = (σ1, . . . , σA), Ht denotes the historical arms and rewards up to round t, bθt denote an estimator of θ evaluated from Ht 1, and the penalty term satisﬁes φt(xa, σ, Ht 1) > 0.

Example 2 summarizes the structure of Lin UCB algorithms for contextual bandit (Abbasi-Yadkori, P al, and Szepesv ari 2011; Chu et al. 2011). The penalty term encourages exploration and a larger value induces more exploration. The penalty term φt(xa, σ, Ht 1) is non-decreasing in σa, a A, capturing that when the reward is subjected to a larger variation, the estimated bθt becomes less accurate, thus, the algorithm needs to do more explorations. Unfortunately, one does not possess the knowledge of σa, a A, in practice. Using a standard deviation larger than σa for Algorithm in Example 2 leads to over exploration, i.e., slow learning speed, while the reverse may cause the algorithm to diverge. In summary, Example 2 highlights a reward distribution tail property mis-match problem (e.g., σa characterizes the tail property of the reward distribution), which is inherent in many UCB based contextual bandit algorithms beyond Lin UCB, e.g., Lin Rel (Auer 2002), Sup Lin UCB (Chu et al. 2011), action elimination algorithms (Lattimore and Szepesv ari 2020).

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

This paper considers how to address the fundamental problem of reward distribution tail property mis-match in a general setting, i.e., the reward distribution can evolve over time and can have non-parametric distribution beyond Gaussian, etc. Speciﬁcally, we deploy multiplier bootstrap methods (Arlot et al. 2010; Yang, Shang, and Cheng 2017) and develop an estimator for the UCB of contextual bandit, which can be directly evaluated from the historical arms and rewards without requiring one to know or specify the tail property of the reward distribution. There are two challenges in providing theoretical guarantees: (1) the ground truth UCB being estimated across arms are correlated, (2) one only possesses the bandit feedback in each round and it is coupled with the estimated UCB. We address them and our contributions are:

We apply the multiplier bootstrap technique to develop a novel estimator for the contextual bandit UCB. The estimator can be directly evaluated from the historical arms and rewards without requiring one to specify the tail property of the reward distribution.

We establish sufﬁcient conditions, under which our estimator converges asymptotically to the ground truth of the contextual bandit UCB. These sufﬁcient conditions guide us to select arms in early time decision rounds.

We further derive a second-order correction for our estimator to obtain its conﬁdence level with only a ﬁnite number of rounds. We select arm associated with the largest data-driven UCB (i.e., corrected estimator) in the remaining time slots (i.e., except the early time slots mentioned above), resulting in our Boot Lin UCB algorithm for the contextual bandit. We prove Boot Lin UCB has a sub-linear regret upper bound. The estimator is general and can be applied to other UCB based contextual bandits, e.g., Lin Rel (Auer 2002), Sup Lin UCB (Chu et al. 2011), etc.

We conduct extensive experiments to validate its superior performance of our Boot Lin UCB algorithm over the latest bootstrapping based Lin UCB algorithm (Hao et al. 2019) and the classical Lin UCB algorithm (Chu et al. 2011).

Related Work

The research of contextual bandit (Chu et al. 2011; Li et al. 2010) can be organized into three lines: algorithmic line, modeling line and application line. The algorithmic line focuses on developing algorithms with faster learning speed, more robust to model misspeciﬁcation, etc., (Abbasi-Yadkori, P al, and Szepesv ari 2011; Chu et al. 2011; Agrawal and Goyal 2013; Zhu et al. 2017; Hao et al. 2019). The modeling line focuses on extending the contextual bandit model to handle more general or complicated settings (Filippi et al. 2010; Wang, Wu, and Wang 2016; Krishnamurthy, Wu, and Syrgkanis 2018; Zhang et al. 2019). The application line focuses on tuning contextual bandit algorithms to solve online decision problems in real-world applications (Li et al. 2010; Hofmann et al. 2011; Glowacka et al. 2019; Ouyang et al. 2019; Zhang et al. 2019). This paper falls into the algorithmic line, in particular, using bootstrapping methods to improve contextual bandit algorithms.

A number of recent works applied bootstrapping methods to design data driven algorithms for contextual bandit. These research can be organized into the Thompson sampling line and UCB line. In the Thompson sampling line, bootstrapping methods are used to design Thompson sampling algorithms for contextual bandit (Eckles and Kaptein 2014; Osband and Van Roy 2015; Tang et al. 2015; Elmachtoub et al. 2017; Kveton et al. 2019; Vaswani et al. 2018). One advantage of the Thompson sampling method is that these algorithms are non-parametric, i.e., they do not require the parametric form on the model. However, these algorithms usually do not have theoretical guarantee, i.e., the regret upper bound, except (Kveton et al. 2019) whose regret upper bound is obtained under Bernoulli assumption. In contrast, our work applies to a broader class of distributions, i.e., symmetry sub-Gaussian distributions. Another technical difference is that our work develops data driven UCB algorithms for contextual bandit with theoretical guarantees. In the UCB line, bootstrapping methods are used to design data driven UCB algorithms for contextual bandit (Sudarsanam and Ravindran 2016; Hao et al. 2019). Again, one advantage over the UCB method is that these algorithms are non-parametric and adaptively adjusted to the ground truth UCB. These algorithms (Sudarsanam and Ravindran 2016; Hao et al. 2019) do not have regret upper bound for contextual bandit problem. Note that the algorithm in (Hao et al. 2019) provides regret upper bound for classical multi-armed bandit problem. However, it is nontrivial to extend it to contextual bandit problems because in the classical multi-armed bandit problem, the UCB for each arm is independent, while in contextual bandit they are correlated. We develop apply the multiplier bootstrap technique to develop a novel estimator for the contextual bandit UCB and establish conditions to guarantee the convergence of the estimator.

Contextual Bandit Model Consider a contextual bandit model with a ﬁnite number of A N+ arms. Denote the arm set as A {1, . . . , A}. Each arm a A is associated with a d N+ dimensional feature vector xa Rd. The feature vector xa is known to the decision maker. Consider a discrete time system indexed by t N+. Each time slot corresponds to one decision epoch or decision round. In time slot t, a subset of At A arms is presented to the decision maker. Then, the decision maker selects one arm from At denoted by at At. Finally, the decision maker receives a reward rt R. Note that the reward of other arms are not revealed to the decision maker. The objective is to design an arm selection algorithm to attain as high as possible the cumulative reward. We consider linear reward functions. Deﬁne the reward function for arm a in time slot t as

Ra,t x T a θ + Wa,t, (1)

where θ denotes a preference vector and the random variable Wa,t denotes a stochastic noise with zero mean E[Wa,t] = 0. Furthermore, Wa,t across arms and time slots are independent (Abbasi-Yadkori, P al, and Szepesv ari 2011; Chu et al. 2011). The preference vector θ is unknown to the decision maker. The stochastic noise Wa,t captures the

randomness or variation in reward. The reward rt in time slot t is a sample or realization of Rat,t. Then the expected reward in time slot t is E[Rat,t] = x T atθ . We consider a risk neutral decision maker, who aims to maximize the expected cumulative reward E[PT t=1 Rat,t]. The linearity of expectation

t=1 E[Rat,t]

implies that the optimal arm denoted by a t in time slot t can be derived as

a t arg max a At E[Ra,t] = arg max a At x T a θ .

However, the optimal arm a t is unknown to the decision maker because the preference vector θ is unknown to the decision maker. The objective is to design an arm selection algorithm, denoted by A, to maximize the expected cumulative reward E[PT t=1 Rat,t]. We consider a class of history-dependent arm selection algorithms A, which prescribe an arm for each interaction history. Denote the reward history up to time slot t as Ht {(a1, xa1, r1), . . . , (at, xat, rt)}, which contains historical arms and rewards. Formally, the algorithm can be represented as a mapping function A : Ht 1 7 At and at = A(Ht 1). We use the regret to quantify the performance of algorithm A, which is:

t=1 x T a t θ E

t=1 x T atθ

at = A(Ht 1)

A smaller regret implies that algorithm A achieves a larger expected cumulative reward.

Boot Lin UCB Algorithmic Framework In this section, we ﬁrst present some basic elements of regularized least squares for contextual bandits. Then we present the formulation of our quantile bootstrapping oracle for contextual bandits. Finally, we apply this quantile bootstrapping oracle to design our Boot Lin UCB algorithmic framework.

Regularized Least Squares Regularized least squares is the main stream method to estimate the preference vector θ of contextual bandits (Lattimore and Szepesv ari 2020):

bθt = arg min θ Rd

s=1 (x T asθ rs)2 + λ θ 2 2,

where bθt denotes an estimation of θ in time slot t and λ > 0 denotes a regularization parameter. The closed form expression for bθt is:

bθt = V 1 t 1

Vt 1 = λI +

s=1 xasx T as.

This closed form expression of bθt implies that

x T a (θ bθt) = E(xa, Ft 1) + λx T a V 1 t 1θ ,

where Ft 1 {a1, . . . , at 1}, and E(xa, Ft 1) is deﬁned as a partial residual

E(xa, Ft 1) x T a V 1 t 1

s=1 xas(x T asθ rs).

The tail property of the reward distribution is essential for deriving the UCB of x T a θ . To illustrate, the following lemma generalizes the UCB in (Chu et al. 2011) for a bounded reward [0, 1] to a sub-Gaussian reward. Lemma 1. Suppose Wa,t is sub-Gaussian, i.e., E[exp(c Wa,t)] exp(c2σ2 a,t/2), c R, and Ft is deterministic, then

P[x T a θ x T a bθt+λx T a V 1 t 1θ +ϕt(xa, Ht 1)] αt, where ϕt(xa, Ht 1) = σmax p

2 ln(1/αt) xa V 1 t 1 denotes an upper bound of the (1 αt)-quantile of the partial residual E(xa, Ft 1) and σmax = maxa,t σa,t. Remark: Lemma 1 implies that an UCB for the ground truth reward x T a θ can be

Ut(a) x T a bθt+λx T a V 1 t 1θ +ϕt(xa, Ht 1). A variety of contextual bandit algorithms use Ut(a) to select arms: Lin UCB algorithm (Chu et al. 2011) selects arms via at arg maxa At Ut(a); Lin Rel (Auer 2002) and Sup Lin UCB (Chu et al. 2011) use Ut(a) to assists arm selection and sub-optimal arm elimination; forced-exploration based algorithms (Abbasi-Yadkori, Antos, and Szepesv ari 2009) can use to Ut(a) to determine the condition of stopping exploration adaptively, just to name a few. However, in practice, the exact ϕt(xa, Ht 1) is unknown to the decision maker, because the parameter σa,t is unknown and Wa,t may not even be sub-Gaussian, making it more difﬁcult to specify the UCB. We next formulate an oracle to bootstrap Ut(a) beyond sub-Gaussian reward.

Quantile Bootstrapping Oracle To be consistent with the condition in Lemma 1, in this subsection, we consider a deterministic Ft. Lemma 1 shows that the upper conﬁdence bound of the ground truth reward x T a θ is determined by the (1 αt)-quantile of the partial residual E(xa, Ft 1). Formally, the ground truth (1 α)-quantile of E(xa, Ft 1) is deﬁned as

qt(xa, 1 α) inf {z R|P [E(xa, Ft 1) z] 1 α} . With the ground truth quantile qt(xa, 1 αt), the ground truth UCB for x T a θ can be expressed as

Υt(a) x T a bθt+λx T a V 1 t 1θ + qt(xa, 1 αt). Note that Υt(a) Ut(a) because ϕt(xa, Ht 1) is an upper bound of the (1 αt)-quantile of the partial residual E(xa, Ft 1), i.e., qt(xa, 1 αt) ϕt(xa, Ht 1). The following deﬁnition states a class of bootstrapping oracles, which bootstrap (or estimate) the quantile qt(xa, 1 αt) from the reward history Ht 1 with theoretical guarantees.

Deﬁnition 1. Deﬁne Boot Quantile(x, Ht 1, α) as an oracle which bootstraps the quantile qt(xa, α) from the interaction history Ht 1, and satisﬁes:

P [E(xa, Ft 1) Qt(xa, α)] α, a At,

where Qt(xa, α) = Boot Quantile(xa, Ht 1, α), α [0, 1], and Ft 1 is deterministic. The oracle deﬁned in Deﬁnition 1 takes the feature vector, interaction history and conﬁdence level as input, and outputs an estimated quantile for each arm satisfying the inputted conﬁdence level. The detail of the bootstrapping oracle is deferred to next section. In this section, let us focus on how to use it to design algorithms for contextual bandit.

Boot Lin UCB Algorithm We apply the Boot Quantile(x, Ht 1, α) oracle to design our Boot Lin UCB algorithmic framework outlined in Algorithm 1, where Ft can be coupled with reward. Note that for other algorithms like forced-exploration based algorithms (Abbasi-Yadkori, Antos, and Szepesv ari 2009), Lin Rel (Auer 2002), Sup Lin UCB (Chu et al. 2011), where Ft is deterministic or independent of the reward, the Boot Quantile(x, Ht 1, α) oracle can also be applied. Due to page limit, we omit them. To execute algorithm 1, one needs to specify a bootstrapping oracle Boot Quantile(xa, Ht 1, α) and the parameters for the conﬁdence level αt, t = 1, . . . , T. In each time slot t, the algorithm ﬁrst computes an estimate of the preference parameter denoted by bθt. It then computes an estimate of the quantile Qt(xa, 1 αt) for each arm by calling the bootstrapping oracle. It uses the estimated quantile to construct a UCB for each arm, and then selects the arm with the largest UCB value. Finally, it receives the reward and updates the interaction history, etc. Note that λx T a V 1 t 1θ is unknown as θ is unknown. In the implementation, one can replace λx T a V 1 t 1θ with its upper bound (Chu et al. 2011) λx T a V 1 t 1θ L

λ xa V 1 t 1, where L is an upper bound of the norm of the preference parameter θ L. The following theorem states the regret upper bound.

Algorithm 1 Boot Lin UCB algorithmic framework

1: Input: an oracle Boot Quantile(xa, Ht 1, α) and conﬁdence level parameters α1, . . . , αT . 2: H0 , b0 0, V0 λI. 3: for t = 1, . . . , T do 4: bθt = V 1 t 1bt 1. 5: Qt(xa,1 αt) Boot Quantile(xa,Ht 1,1 αt). 6: Choose arm

at arg max a At

h x T a bθt+λx T a V 1 t 1θ +Qt(xa, 1 αt) i .

7: Observe reward rt. 8: Vt Vt 1 + xatx T at. 9: bt bt 1 + xatrt. 10: Ht Ht 1 {(at, xat, rt)}. 11: end for

Theorem 1. If for each given a, Wa,t, t are identical, the regret of Algorithm 1 can be bounded as

RT (ABoot Lin UCB)

t=1 E [min{Qt(xat, 1 αt) + Qt( xat, 1 αt), gt}]

t + A 2 A 1

where gt maxa At x T a θ mina At x T a θ denotes the maximum possible regret in time slot t. Remark: Due to page limit, we present all proofs in our supplementary ﬁle. Though in Algorithm 1 Ft depends on the reward, we can decouple it via the conditioning trick in the analysis. The term PT t=1 t+A 2 A 1 αtgt has an order of O(PT t=1 t+A 2 A 1 αt). For example, if we select αt = 1/( t+A 2 A 1 t), then we have O(PT t=1 t+A 2 A 1 αt) = O(ln T). Namely, this term can be made sub-linear. To analyze the order of PT t=1 E [min{Qt(xat, 1 αt) + Qt( xat, 1 αt), gt}] we need more details of the bootstrapping oracle, and we defer it to next section.

Bootstrapping Algorithm We ﬁrst apply the multiplier bootstrap technique to design an estimator for the quantile qt(xa, α). We establish sufﬁcient conditions, under which our estimator converges. These conditions and the estimator guide us to design an algorithm to implement the bootstrapping oracle.

Asymptotically Accurate Estimator To be consistent with the condition in deﬁnition 1, in this subsection, we consider a deterministic Ft. The quantile qt(xa, α) can be rewritten as

x T a V 1 t 1

s=1 xas(rs x T asθ ) z

In the above quantile, the randomness is caused by reward rs, s = 1, . . . , t, which are independent samples from the reward distribution expressed in Equation (1). We apply the multiplier bootstrapping technique (Arlot et al. 2010) to resample the reward rs, s = 1, . . . , t, for the purpose of estimating the quantile qt(xa, α). Formally, we deﬁne an estimator for qt(xa, α) as:

bqt(xa, ϵ, α)

s=1 wsx T a V 1 t 1xasϵs z

where w1, . . . , wt 1 are independent and identically distributed (IID) random variables following the Rademacher distribution, i.e., ws = 1 with probability 0.5 and ws = 1 with probability 0.5. We deﬁne w (w1, . . . , wt 1), ϵs

x T asθ rs as reward residual, ϵ (ϵ1, . . . , ϵt 1), and Pw as computing probability with respect to randomness caused by w. To analyze the estimator bqt(xa, ϵ, α), we need to characterize the tail of the reward distribution. Let Z denote the space of random variables such that Wa,t Z, a, t. The convergence of the estimator bqt(xa, ϵ, α) relies on the tail property of Z deﬁned as follows. Deﬁnition 2. Deﬁne a metric

HZ(z) sup Z Z

E Z2; Z2 > z

to quantify how heaviness of the tail of a space of random variables Z, where z R+. For each given z, a larger HZ(z) implies that the tail of the space of random variables Z is heavier. The following assumption captures a class of distributions with nice tails. Assumption 1. The space of random variables Z satisﬁes

lim z HZ(z) 0.

Assumption 1 characterizes a broad class of distributions. For example, if Z is a space of random variables with a bounded domain, then Condition 1 holds. If Z is a space of sub-Gaussian random variables, then Assumption 1 also holds. Assumption 2. There exist 0 < c1 < c2 such that the variance V ar(Wa,t) [c1, c2], a, t. The next condition is essential for the convergence of the quantile estimator bqt(xa, ϵ, α). Condition 1. For any a A, it holds that

lim t xa V 1 t 0.

Condition 1 identiﬁes a sequence of well behaved actions. Due to correlation among the quantile of arms, the quantile estimator bqt(xa, ϵ, α) may not converge under poorly behaved action sequences. We will show how to design arm selection strategies to guarantee Condition 1 in next section. Theorem 2. Under Assumption 1, 2 and Condition 1, the quantile estimator bqt(xa, ϵ, α) converges to the ground truth qt(xa, α), i.e.,

lim t |bqt(xa, ϵ, α) qt(xa, α)| = 0, a, α,

where Ft 1 is deterministic. Remark: Theorem 2 states the sufﬁcient conditions under which the estimator bqt(xa, ϵ, α) converges to the ground truth. Namely, the estimator bqt(xa, ϵ, α) is asymptotically accurate. However, it is difﬁcult to implement, as it requires the preference parameter θ . To relieve this difﬁculty, we replace θ with bθt. Formally, we express the new estimator as

bqt(xa, bϵ, α)

s=1 wsx T a V 1 t 1xasbϵs z

where bϵs = x T as bθt rs denotes the empirical residual and bϵ = (bϵ1, . . . , bϵt 1). The following theorem establish the asymptotic convergence of this estimator. Theorem 3. Under the same conditions as Theorem 2, the quantile estimator bqt(xa, ϵ, α) converges to the ground truth qt(xa, α), i.e.,

lim t |bqt(xa, bϵ, α) qt(xa, α)| = 0, a, α,

where Ft 1 is deterministic. Remark: Theorem 3 states sufﬁcient conditions under which the quantile estimator bqt(xa, bϵ, α) converges to the ground truth quantile qt(xa, α). Note that it requires the same condition as Theorem 2, i.e., no extra conditions are needed.

Non-Asymptotic Validity of Estimator The quantile estimator expressed in Eq. (2) has the nice asymptotic accurate property. However, it can not be directly used to design our bootstrapping oracle as we do not known the conﬁdence level of it. Now, we establish its conﬁdence level via second-order correction (Arlot et al. 2010; Hao et al. 2019). The second order correction relies on the tail property of the reward distribution, in particular, the concentration behavior of E(xa, Ft 1). We deﬁne a function β( ) to characterize the concentration behavior. Deﬁnition 3. Deﬁne β( ) as a function such that

P[E(xa, Ft 1) β(α) xa V 1 t 1] α,

where Ft 1 is deterministic. For example, as derived in Lemma 1, when the reward follows a sub-Gaussian distribution, the function β(α) has the expression β(α) = O(ln(1/α)). For other distributions with tail heavier than the sub-Gaussian distribution (Bubeck, Cesa-Bianchi, and Lugosi 2013), we may have β(α) = O(poly(1/α)). Furthermore, β(α) can be made to inﬁnity, i.e., β(α) = exp(1/α) or β(α) = , to characterize reward distribution with heavier tails. Namely, β(α) can characterize the full design space of reward distribution. Theorem 4. Suppose Condition 1 holds and Ft is deterministic. Suppose the Wa,t, a, t, follow symmetric distribution. Then we have:

P [E(xa, Ft 1) bqt(xa, bϵ, 1 α(1 δ)) + (t, α)] 2α,

where δ (0, 1),

s=1 (x Ta V 1 t xas)2 xas 2 V 1 t ln 1

and xa is dependent of xa1, . . . , xat 1. Furthermore, limt (t, α) = 0 holds for any ﬁxed α. Remark: Theorem 4 states the sufﬁcient conditions and a second order correction term (t, α), under which the quantile estimator bqt(xa, bϵ, α) can provide conﬁdence level. Furthermore, for each ﬁxed conﬁdence level, the second order correction term (t, α) vanishes. Comparing with Theorem 3, one can observe that Theorem 4 requires an extra condition that the reward distribution is symmetric. Note that

this extra condition is not due to the contextual bandit, but instead from the bootstrapping technique. To the best of our knowledge, handling non-symmetric distribution for nonasymptotic convergence is an open problem (Arlot et al. 2010; Chernozhukov et al. 2014; Yang, Shang, and Cheng 2017).

Quantile Bootstrapping Algorithm Design Even though the corrected quantile estimator derived in Theorem 4, i.e., bqt(x, bϵ, α(1 δ)) + (t, α), has the desired property as stated in Deﬁnition 1, one should not directly implement the bootstrapping oracle Boot Quantile as bqt(x, bϵ, α(1 δ)) + (t, α). This is because this implementation may not guarantee Condition 1 to hold, which in turn may lead to Theorem 4 not to hold. We will ﬁrst reﬁne bqt(x, bϵ, α(1 δ)) + (t, α), and then use this reﬁnement to design the bootstrapping oracle Boot Quantile such that Condition 1 can be guaranteed. To facilitate the analysis, we deﬁne the following notation to quantify the norm of a set of arms:

At V 1 t 1 max a At xa V 1 t 1.

Based on this notation, the following lemma states a sufﬁcient condition to guarantee Condition 1. Lemma 2. Suppose it holds that

dim(At) = dim(A), t, (3)

Then lim supt At V 1 t 1 = 0 implies Condition 1.

To show lemma 2, the following lemma derives a sufﬁcient condition for Condition 1. Lemma 3. Suppose Eq. (3) holds. We have:

lim sup t At V 1 t 1 > 0 lim inf t At V 1 t 1 > 0.

Lemma 3 states that if lim inft At V 1 t 1 > 0 does not hold, then Condition 1 holds. It is easier to show that lim inft At V 1 t 1 > 0 leads to contraction than directly showing lim supt At V 1 t 1 = 0. The following theorem states the sufﬁcient conditions under which lim inft At V 1 t 1 > 0 leads to contradiction, and it further reﬁnes the quantile estimator. Theorem 5. Suppose Eq (3) holds. Suppose αt goes to zero as t goes to inﬁnity. If Boot Quantile(x, Ht 1, 1 αt) returns eqt(x, bϵ, 1 αt

2 (1 δ)), Algo. 1 guarantees Condition 1, where eqt(x, bϵ, α) is deﬁned as an reﬁned quantile estimator

eqt(x, bϵ, α) t, 1 α

(+ , if dim({x1, . . . , xat})> dim({x1, . . . , xat 1}),

min n x T bθt+bqt(x, bϵ, α), 2SL o x T bθt, otherwise,

where xa S, a A. Remark: Theorem 5 states sufﬁcient conditions on the action set At the conﬁdence parameter αt and a reﬁned quantile estimator, such that Condition 1 holds. The condition on

αt means that αt vanishes, which is commonly hold. The condition on action set At is that the dimension of the linear space spanned by At is the same as that spanned by A. The main purpose of this condition is to make the proof and the statement of Theorem 5 clean. The reﬁned estimator eqt(x, bϵ, α) means that we should always give the highest priority to those arms whose feature vector can increase the dimension of the linear space spanned by the feature vectors of historical actions. The following lemma states that we can still provide conﬁdence level for the reﬁned quantile estimator. Lemma 4. The reﬁned quantile estimator eqt(x, bϵ, α) has the following conﬁdence level:

P h E(xa, Ft 1) eqt x, bϵ, 1 α

2 (1 δ) i 1 α,

where Ft is deterministic. Lemma 4 provides conﬁdence level for the reﬁned quantile estimator. Namely, the reﬁned quantile estimator has the nice properties deﬁned in Deﬁnition 1. Thus, we use this reﬁned quantile estimator to design our bootstrapping oracle. Algorithm 2 outlines the procedures to compute the reﬁned oracle. The key step of Algorithm 2 is step 4, which computes the estimator bqt(xa, bϵ, αt(1 δ)). It is computationally expensive to compute the exact value for bqt(xa, bϵ, αt(1 δ)). In the implementation, one can use Monte Carlo simulation to approximate it, which is quite common in multiplier bootstrapping literature (Arlot et al. 2010; Hao et al. 2019). The following theorem derive the regret as: Theorem 6. Suppose Eq. (3) holds. Under the condition of Theorem 1 and Algo. 2, we have

RT (ABoot Lin UCB)

t=dim(A)+1 E [eqt(xat, bϵ, 1 αt) eqt( xat, bϵ, 1 αt)] .

Furthermore, suppose the reward follows sub-Gaussian distribution, and αt = 1/( t+A 2 A 1 t), then the above regret has an order of RT (ABoot Lin UCB) O(A

T(ln T)2). Remark: Theorem 6 states a general regret upper bound for our boot Lin UCB algorithm under the reﬁned quantile

Algorithm 2 Boot Quantile(xa, Ht 1, 1 2αt)

1: a x T a bθt. 2: Compute bϵ from the reward history Ht 1 3: c (t, αt) . 4: b bqt(xa, bϵ, αt(1 δ)) 5: if dim({xa1, . . . , xat 1})< dim({xa1, . . . , xat}) then 6: RETURN + . 7: else 8: RETURN min {a + b, 2SL} a + c 9: end if

estimator. This regret upper bound can be further simpliﬁed to be sub-linear if the reward follows sub-Gaussian distribution.

Experiments on Synthetic Data Experiment setting. We compare our Boot Lin UCB algorithm with the latest bootstrapping based Lin UCB algorithm (Hao et al. 2019) and the classical Lin UCB algorithm (Chu et al. 2011). Since the latest bootstrapping based Lin UCB algorithm (Hao et al. 2019) does not have theoretical guarantee, we denote it by Boot Lin Heu, where Heu represents heuristic. Consider T = 2000 decision rounds. The Wa,t, a, t follow normal distribution with mean 0 and variance σ. We generate the feature vectors of A arms as follows: (1) generate min{d, A} orthogonal feature vectors with unit square norm (details refer to our code); (2) each of the remaining A min{d, A} feature vectors is drawn from [0, 1]d uniformly at random. The preference parameter θ is drawn from [0, 1]d uniformly at random. In each round, we set At = A. Similar with previous works (Arlot et al. 2010; Hao et al. 2019), we use Monte Carlo simulation to estimate the quantile estimator bqt(x, bϵ, α) with 1000 simulation rounds. We set αt = 1/ t + 2, δ = 1/(t + 2) for our Boot Lin UCB, and set αt = 1/ t + 2 for the Lin UCB algorithm. We set parameters for the Boot Lin Heu algorithm according to (Hao et al. 2019). Unless we state explicitly, we consider the following default parameters: A = 20 arms, features with d = 10 dimension, regularization parameter λ = 1, reward variance σ = 1. For each algorithm, we repeatedly run it for 100 times, and calculate its average regret over 100 times. Convergence comparison. We ﬁrst use a speciﬁc setting to show that the Boot Lin Heu (Hao et al. 2019) has a risk of diverging, while the Boot Lin UCB algorithm does not have this risk. We set A = 10 and d = 5. Five of the feature vectors are ﬁve standard base vectors, and the remaining ﬁve as well as θ are:

(0.08, 0.32, 0.22, 0.14, 0.73), (0.1, 0.22, 0.15, 0.09, 0.68), (0.58, 0.87, 0.32, 0.3, 0.14), (0.18, 0.88, 0.83, 0.24, 0.65), (0.86, 0.2, 0.51, 0.83, 0.97), θ =(0.23, 0.36, 0.61, 0.26, 0.73).

Figure 1 shows ten regret curves for the Boot Lin UCB and Boot Lin Heu algorithm respectively. One can observe that all regret curves of Boot Lin UCB are ﬂat, while three out of ten regret curves of Boot Lin Heu increases linearly in t. This implies that the Boot Lin Heu has a risk of diverging (i.e., always select an sub-optimal arm leading to regret increases linearly in t), while our Boot Lin UCB algorithm does not have this risk. Thus, in remaining experiments we do not further compare with Boot Lin Heu. Mismatch of reward tail distribution. Now we study the robustness of our Boot Lin UCB algorithm with respect to the mismatching of reward distribution tail. We input the variance of reward distribution σin = 1 to both Lin UCB and Boot Lin UCB. Figure 2 shows the average regret of Lin UCB and Boot Lin UCB as we vary the ground truth variance σ from 0.1 to 0.5. One can observe that when σ = 0.1, the average regret of Boot Lin UCB is around half of Lin UCB (T=2000). Increasing the ground truth variance to σ = 0.5,

0 600 1200 1800 T

10 regret curvers for Boot Lin UCB All showing the convergence property

(a) Boot Lin UCB

0 600 1200 1800 T

10 regret curvers for Boot Lin Heu

Possibility of divergence

(b) Boot Lin Heu

Figure 1: Regret curves of Boot Lin UCB and Boot Lin Heu.

the average regret of Boot Lin UCB is around 100/160 = 5/8 of Lin UCB (T=2000). These results further conﬁrm that the Boot Lin UCB can signiﬁcantly reduce the over exploration problem Lin UCB caused by mismatching of the reward tail distribution.

Conclusion This paper presents the Boot Lin UCB algorithm for the contextual bandit. Boot Lin UCB is a data driven UCB based algorithm, which uses the multiplier bootstrapping technique to estimate the UCB of contextual from the historical rewards. The Boot Lin UCB is more robust to misspeciﬁcation of the reward tail distribution than the previous reward tail distribution based UCB algorithms in contextual bandit. In particular, we design an estimator for the UCB of contextual bandit with theoretical guarantee on the convergence. Based on the estimator we design our Boot Lin UCB algorithm. We also prove that the Boot Lin UCB has a sub-linear regret upper bound and conduct extensive experiments to show its superior performance over a variety of baselines. Our approach open doors for others to consider similar bootstrapping technique for other online learning algorithms or reinforcement algorithms.

Acknowledgments

This work was supported in part by Chongqing Natural Science Foundation (cstc2020jcyj-msxm X0652), Chongqing High-Technology Innovation and Application Development Funds (cstc2019jscx-msxm X0422, cstc2019jscx-fxyd0385) and the Fundamental Research Funds for the Central Universities (2020CDJ-LHZZ-057). Hong Xie is the corresponding author.

0 500 1000 1500 2000 T

Average Regret

Lin UCB Boot Lin UCB

(a) σ = 0.1

0 500 1000 1500 2000 T

Average Regret

Lin UCB Boot Lin UCB

(b) σ = 0.5

Figure 2: Regret under mismatch of reward tail distribution.

References Abbasi-Yadkori, Y.; Antos, A.; and Szepesv ari, C. 2009. Forced-exploration based algorithms for playing in stochastic linear bandits. In COLT Workshop on On-line Learning with Limited Feedback, volume 91, 235.

Abbasi-Yadkori, Y.; P al, D.; and Szepesv ari, C. 2011. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, 2312 2320.

Agrawal, S.; and Goyal, N. 2013. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, 127 135.

Arlot, S.; Blanchard, G.; Roquain, E.; et al. 2010. Some nonasymptotic results on resampling in high dimension, I: conﬁdence regions. In The Annals of Statistics, volume 38, 51 82. Institute of Mathematical Statistics.

Auer, P. 2002. Using conﬁdence bounds for exploitationexploration trade-offs. In Journal of Machine Learning Research, volume 3, 397 422.

Bubeck, S.; Cesa-Bianchi, N.; and Lugosi, G. 2013. Bandits with heavy tail. In IEEE Transactions on Information Theory, volume 59, 7711 7717. IEEE.

Chernozhukov, V.; Chetverikov, D.; Kato, K.; et al. 2014. Gaussian approximation of suprema of empirical processes. In The Annals of Statistics, volume 42, 1564 1597. Institute of Mathematical Statistics.

Chu, W.; Li, L.; Reyzin, L.; and Schapire, R. 2011. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, 208 214.

Eckles, D.; and Kaptein, M. 2014. Thompson sampling with the online bootstrap. In ar Xiv preprint ar Xiv:1410.4009.

Elmachtoub, A. N.; Mc Nellis, R.; Oh, S.; and Petrik, M. 2017. A practical method for solving contextual bandit problems using decision trees. In ar Xiv preprint ar Xiv:1706.04687.

Filippi, S.; Cappe, O.; Garivier, A.; and Szepesv ari, C. 2010. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, 586 594.

Gampa, P.; and Fujita, S. 2019. Bandit Rank: Learning to Rank Using Contextual Bandits. In ar Xiv preprint ar Xiv:1910.10410.

Glowacka, D.; et al. 2019. Bandit algorithms in information retrieval. In Foundations and Trends in Information Retrieval, volume 13, 299 424. Now Publishers, Inc.

Hao, B.; Yadkori, Y. A.; Wen, Z.; and Cheng, G. 2019. Bootstrapping Upper Conﬁdence Bound. In Advances in Neural Information Processing Systems, 12123 12133.

Hofmann, K.; Whiteson, S.; de Rijke, M.; et al. 2011. Contextual bandits for information retrieval. In NIPS 2011 Workshop on Bayesian Optimization, Experimental Design, and Bandits, Granada, volume 12, 2011.

Krishnamurthy, A.; Wu, Z. S.; and Syrgkanis, V. 2018. Semiparametric contextual bandits. In ar Xiv preprint ar Xiv:1803.04204.

Kveton, B.; Szepesvari, C.; Vaswani, S.; Wen, Z.; Lattimore, T.; and Ghavamzadeh, M. 2019. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, 3601 3610. Long Beach, California, USA: PMLR. Lattimore, T.; and Szepesv ari, C. 2020. Bandit algorithms. Cambridge University Press. Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, 661 670. ACM. Osband, I.; and Van Roy, B. 2015. Bootstrapped thompson sampling and deep exploration. In ar Xiv preprint ar Xiv:1507.00300. Ouyang, T.; Li, R.; Chen, X.; Zhou, Z.; and Tang, X. 2019. Adaptive User-managed Service Placement for Mobile Edge Computing: An Online Learning Approach. In IEEE INFOCOM 2019-IEEE Conference on Computer Communications, 1468 1476. IEEE. Sudarsanam, N.; and Ravindran, B. 2016. Linear Bandit algorithms using the Bootstrap. In ar Xiv preprint ar Xiv:1605.01185.

Tang, L.; Jiang, Y.; Li, L.; Zeng, C.; and Li, T. 2015. Personalized recommendation via parameter-free contextual bandits. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, 323 332.

Vaswani, S.; Kveton, B.; Wen, Z.; Rao, A.; Schmidt, M.; and Abbasi-Yadkori, Y. 2018. New insights into bootstrapping for bandits. In ar Xiv preprint ar Xiv:1805.09793. Wang, H.; Wu, Q.; and Wang, H. 2016. Learning Hidden Features for Contextual Bandits. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 16, 1633 1642. New York, NY, USA: ACM. ISBN 978-1-4503-4073-1. doi: 10.1145/2983323.2983847. URL http://doi.acm.org/10.1145/ 2983323.2983847. Yang, Y.; Shang, Z.; and Cheng, G. 2017. Non-asymptotic theory for nonparametric testing. In ar Xiv preprint ar Xiv:1702.01330.

Zhang, X.; Xie, H.; Li, H.; and Lui, J. 2019. Toward Building Conversational Recommender Systems: A Contextual Bandit Approach. In ar Xiv preprint ar Xiv:1906.01219. Zhu, F.; Zhu, X.; Wang, S.; Yao, J.; and Huang, J. 2017. Robust Contextual Bandit via the Capped Ell Two Norm. In ar Xiv preprint ar Xiv:1708.05446.