# doubly_robust_thompson_sampling_with_linear_payoffs__325850af.pdf Doubly Robust Thompson Sampling with Linear Payoffs Wonyoung Kim Department of Statistics Seoul National University eraser347@snu.ac.kr Gi-Soo Kim Department of Industrial Engineering & Artificial Intelligence Graduate School UNIST gisookim@unist.ac.kr Myunghee Cho Paik Department of Statistics Seoul National University Shepherd23 Inc. myungheechopaik@snu.ac.kr A challenging aspect of the bandit problem is that a stochastic reward is observed only for the chosen arm and the rewards of other arms remain missing. The dependence of the arm choice on the past context and reward pairs compounds the complexity of regret analysis. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust (DR) Thompson Sampling employing the doubly-robust estimator used in missing data literature to Thompson Sampling with contexts (Lin TS). Different from previous works relying on missing data techniques (Dimakopoulou et al. [2019], Kim and Paik [2019]), the proposed algorithm is designed to allow a novel additive regret decomposition leading to an improved regret bound with the order of O(φ 2 T), where φ2 is the minimum eigenvalue of the covariance matrix of contexts. This is the first regret bound of Lin TS using φ2 without the dimension of the context, d. Applying the relationship between φ2 and d, the regret bound of the proposed algorithm is O(d T) in many practical scenarios, improving the bound of Lin TS by a factor of d. A benefit of the proposed method is that it utilizes all the context data, chosen or not chosen, thus allowing to circumvent the technical definition of unsaturated arms used in theoretical analysis of Lin TS. Empirical studies show the advantage of the proposed algorithm over Lin TS. 1 Introduction Contextual bandit has been popular in sequential decision tasks such as news article recommendation systems. In bandit problems, the learner sequentially pulls one arm among multiple arms and receives random rewards on each round of time. While not knowing the compensation mechanisms of rewards, the learner should make his/her decision to maximize the cumulative sum of rewards. In the course of gaining information about the compensation mechanisms through feedback, the learner should carefully balance between exploitation, pulling the best arm based on information accumulated so far, and exploration, pulling the arm that will assist in future choices, although it does not seem to be the best option at the moment. Therefore in the bandit problem, estimation or learning is an important element besides decision making. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Table 1: The shaded data are used in complete record analysis (left) and DR method (right) under multi-armed contextual bandit settings. The contexts, rewards and DR imputing values are denoted by X, Y , and Y DR, respectively. The question mark refers to the missing reward of unchosen arms. t = 1 t = 2 Arm 1 X1(1) ? X1(2) ? Arm 2 X2(1) ? Xa2(2) Ya2(2) Arm 3 Xa1(1) Ya1(1) X3(2) ? Arm 4 X4(1) ? X4(2) ? t = 1 t = 2 Arm 1 X1(1) Y DR 1 (1) X1(2) Y DR 1 (2) Arm 2 X2(1) Y DR 2 (1) Xa2(2) Y DR a2 (2) Arm 3 Xa1(1) Y DR a1 (1) X3(2) Y DR 3 (2) Arm 4 X4(1) Y DR 4 (1) X4(2) Y DR 4 (2) A challenging aspect of estimation in the bandit problem is that a stochastic reward is observed only for the chosen arm. Consequently, only the context and reward pair of the chosen arm is used for estimation, which causes dependency of the context data at the round on the past contexts and rewards. To handle this difficulty, we view bandit problems as missing data problems. The first step in handling missing data is to define full, observed, and missing data. In bandit settings, full data consist of rewards and contexts of all arms; observed data consist of full contexts for all arms and the reward for the chosen arm; missing data consist of the rewards for the arms that are not chosen. Typical estimation procedures require both rewards and contexts pairs to be observed, and the observed contexts from the unselected are discarded (see Table 1). The analysis based on the completely observed pairs only is called complete record analysis. Most stochastic bandit algorithms utilize estimates based on complete record analysis. Estimators from complete record analysis are known to be inefficient. In bandit setting, using the observed data whose probability of observation depends on previous rewards requires special theoretical treatment. There are two main approaches to missing data: imputation and inverse probability weighting (IPW). Imputation is to fill in the predicted value of missing data from a specified model, and IPW is to use the observed records only but weight them by the inverse of the observation probability. The doubly robust (DR) method [Robins et al., 1994, Bang and Robins, 2005] is a combination of imputation and IPW tools. We provide a review of missing data and DR methods in supplementary materials. The robustness against model misspecification in missing data settings is insignificant in the bandit setting since the probability of observation or allocation to an arm is known. The merit of the DR method in the bandit setting is its ability to employ all the contexts including unselected arms. We propose a novel multi-armed contextual bandit algorithm called Doubly Robust Thompson Sampling (DRTS) that applies the DR technique used in missing data literature to Thompson Sampling with linear contextual bandits (Lin TS). The main thrust of DRTS is to utilize contexts information for all arms, not just chosen arms. By using the unselected, yet observed contexts, along with a novel algorithmic device, the proposed algorithm renders a unique regret decomposition which leads to a novel regret bound without resorting to the technical definition of unsaturated arms used by Agrawal and Goyal [2014]. Since categorizing the arms into saturated vs. unsaturated plays a critical role in costing extra d, by circumventing it, we prove a O(d T) bound of the cumulative regret in many practical occasions compared to O(d3/2 T) shown in Agrawal and Goyal [2014]. The main contributions of this paper are as follows. We propose a novel contextual bandit algorithm that improves the cumulative regret bound of Lin TS by a factor of d (Theorem 1) in many practical scenarios (Section 4.1). This improvement is attained mainly by defining a novel set called super-unsaturated arms, that is utilizable due to the proposed estimator and resampling technique adopted in the algorithm. We provide a novel estimation error bound of the proposed estimator (Theorem 3) which depends on the minimum eigenvalue of the covariance matrix of the contexts from all arms without d. We develop a novel dimension-free concentration inequality for sub-Gaussian vector martingale (Lemma 4) and use it in deriving our regret bound in place of the self-normalized theorem by Abbasi-Yadkori et al. [2011]. We develop a novel concentration inequality for the bounded matrix martingale (Lemma 6) which improves the existing result (Proposition 5) by removing the dependency on d in the bound. Lemma 6 also allows eliminating the forced sampling phases required in some bandit algorithms relying on Proposition 5 [Amani et al., 2019, Bastani and Bayati, 2020]. All missing proofs are in supplementary materials. 2 Related works Thompson Sampling [Thompson, 1933] has been extensively studied and shown solid performances in many applications (e.g. Chapelle and Li [2011]). Agrawal and Goyal [2013] is the first to prove theoretical bounds for Lin TS and an alternative proof is given by Abeille et al. [2017]. Both papers show O(d3/2 T) regret bound, which is known as the best regret bound for Lin TS. Recently, Hamidi and Bayati [2020] points out that O(d3/2 T) could be the best possible one can get when the estimator used by Lin TS is employed. In our work, we improve this regret bound by a factor of d in many practical scenarios through a novel definition of super-unsaturated arms, which becomes utilizable due to the proposed estimator and resampling device implemented in the algorithm. Our work assumes the independence of the contexts from all arms across time rounds. Some notable works have used the assumption that the contexts are independently identically distributed (IID). Leveraging the IID assumption with a margin condition, Goldenshluger and Zeevi [2013] derives a two-armed linear contextual bandit algorithm with a regret upper bound of order O(d3log T). Bastani and Bayati [2020] has extended this algorithm to any number of arms and improves the regret bound to O(d2log 3 2 d log T). The margin condition states that the gap between the expected rewards of the optimal arm and the next best arm is nonzero with some constant probability. This condition is crucial in achieving a O(log T) regret bound instead of O( T). In this paper, we do not assume this margin condition, and focus on the dependence on the dimension of contexts d. From a missing data point of view, most stochastic contextual bandit algorithms use the estimator from complete record analysis except Dimakopoulou et al. [2019] and Kim and Paik [2019]. Dimakopoulou et al. [2019] employs an IPW estimator that is based on the selected contexts alone. Dimakopoulou et al. [2019] proves a O(d ϵ 1T 1+ϵN) regret bound for their algorithm which depends on the number of arms, N. Kim and Paik [2019] considers the high-dimensional settings with sparsity, utilizes a DR technique, and improves the regret bound in terms of the sparse dimension instead of the actual dimension of the context, d. Kim and Paik [2019] is different from ours in several aspects: the mode of exploration (ϵ-greedy vs. Thompson Sampling), the mode of regularization (Lasso vs. ridge regression); and the form of the estimator. A sharp distinction between the two estimators lies in that Kim and Paik [2019] aggregates contexts and rewards over the arms although they employ all the contexts. If we apply this aggregating estimator and DR-Lasso bandit algorithm to the low-dimensional setting, we obtain a regret bound of order O( Nd T) when the contexts from the arms are independent. This bound is bigger than our bound by a factor of d and N. It is because the aggregated form of the estimator does not permit the novel regret decomposition derived in Section 4.2. The proposed estimator coupled with a novel algorithmic device renders the additive regret decomposition which in turn improves the order of the regret bound. 3 Proposed estimator and algorithm 3.1 Settings and assumptions We denote a d-dimensional context for the ith arm at round t by Xi(t) Rd, and the corresponding random reward by Yi(t) for i = 1, . . . , N. We assume E [Yi(t)| Xi(t)] = Xi(t)T β for some unknown parameter β Rd. At round t, the arm that the learner chooses is denoted by at {1, . . . , N}, and the optimal arm by a t := arg maxi=1,...,N Xi(t)T β . Let regret(t) be the difference between the expected reward of the chosen arm and the optimal arm at round t, i.e., regret(t) := Xa t (t)T β Xat(t)T β. The goal is to minimize the sum of regrets over T rounds, R(T) := PT t=1 regret(t). The total round T is finite but possibly unknown. We also make the following assumptions. Assumption 1. Boundedness for scale-free regrets. For all i = 1, . . . , N and t = 1, . . . , T, we have Xi(t) 2 1 and β 2 1. Assumption 2. Sub-Gaussian error. Let Ht := St 1 τ=1 {Xi(τ)}N i=1 {aτ} {Yaτ (τ)} {Xi(t)}N i=1 be the set of observed data at round t. For each t and i, the error ηi(t) := Yi(t) Xi(t)T β is conditionally zero-mean σ-sub-Gaussian for a fixed constant σ 0, i.e, E [ηi(t)| Ht] = 0 and E [exp (ληi(t))| Ht] exp(λ2σ2/2), for all λ R. Furthermore, the distribution of ηi(t) does not depend on the choice at round t, i.e. at. Assumption 3. Independently distributed contexts. The stacked contexts vectors {Xi(1)}N i=1, . . . , {Xi(T)}N i=1 Rd N are independently distributed. Assumption 4. Positive minimum eigenvalue of the average of covariance matrices. For each t, there exists a constant φ2 > 0 such that λmin E h 1 N PN i=1 Xi(t)Xi(t)T i φ2. Assumptions 1 and 2 are standard in stochastic bandit literature Agrawal and Goyal [2013]. We point out that given round t, Assumption 3 allows that the contexts among different arms, X1(t), . . . , XN(t) are correlated to each other. Assumption 3 is weaker than the assumption of IID, and the IID condition is considered by Goldenshluger and Zeevi [2013] and Bastani and Bayati [2020]. As Bastani and Bayati [2020] points out, the IID assumption is reasonable in some practical settings, including clinical trials, where health outcomes of patients are independent of those of other patients. Both Goldenshluger and Zeevi [2013] and Bastani and Bayati [2020] address the problem where the contexts are equal across all arms, i.e. X(t) = X1(t) = . . . = XN(t), while our work admits different contexts over all arms. Assumption 4 guarantees that the average of covariance matrices of contexts over the arms is well-behaved so that the inverse of the sample covariance matrix is bounded by the spectral norm. This assumption helps controlling the estimation error of β in linear regression models. Similar assumptions are adopted in existing works in the bandit setting [Goldenshluger and Zeevi, 2013, Amani et al., 2019, Li et al., 2017, Bastani and Bayati, 2020]. 3.2 Doubly robust estimator To describe the contextual bandit DR estimator, let πi(t) := P (at = i| Ht) > 0 be the probability of selecting arm i at round t. We define a DR pseudo-reward as Y DR i (t) = 1 I (i = at) Xi(t)T βt + I (i = at) πi(t) Yat(t), (1) for some βt depending on Ht. Background of missing data methods and derivation of the DR pseudo-reward is provided in the supplementary material. Now, we propose our new estimator bβt with a regularization parameter λt as below: i=1 Xi(τ)Xi(τ)T + λt I i=1 Xi(τ)Y DR i (τ) Harnessing the pseudo-rewards defined in (1), we can make use of all contexts rather than just selected contexts. The DR estimator by Kim and Paik [2019] utilizes all contexts but has a different form from ours. While Kim and Paik [2019] uses Lasso estimator with pseudo-rewards aggregated over all arms, we use ridge regression estimator with pseudo-rewards in (1) which are defined separately for each i = 1, . . . , N. This seemingly small but important difference in forms paves a way in rendering our unique regret decomposition and improving the regret bound. 3.3 Algorithm In this subsection, we describe our proposed algorithm, DRTS which adapts DR technique to Lin TS. The DRTS is presented in Algorithm 1. Distinctive features of DRTS compared to Lin TS include the novel estimator and the resampling technique. At each round t 1, the algorithm samples βi(t) from the distribution N(bβt 1, v2V 1 t 1) for each i independently. Let Yi(t) := Xi(t)T βi(t) and mt := arg maxi Yi(t). We set mt as a candidate action and compute πmt(t) := P( Ymt(t) = maxi Yi(t)|Ht). 1 If πmt(t) > γ, then the arm mt is selected, i.e., at = mt. Otherwise, the algorithm resamples βi(t) until it finds another arm satisfying πi(t) > γ up to a predetermined fixed value Mt. Section A.3 in supplementary materials describes issues related to Mt including a suitable choice of Mt. 1This computation is known to be challenging but employing the independence among β1(t), . . . , βN(t), we derive an explicit form approximating πmt(t) in supplementary materials Section H.1. Algorithm 1 Doubly Robust Thompson Sampling for Linear Contextual Bandits (DRTS) Input: Exploration parameter v > 0, Regularization parameter λ > 0, Selection probability threshold γ [1/(N + 1), 1/N), Imputation estimator βu = f({X(τ), Yaτ (τ)}u 1 τ=1), Number of maximum possible resampling Mt. Set F0 = 0, W0 = 0, bβ0 = 0 and V0 = λI for t = 1 to T do Observe contexts {Xi(t)}N i=1. Sample β1(t), . . . , βN(t) from N(bβt 1, v2V 1 t 1) independently. Compute Yi(t) = Xi(t)T βi(t) Observe a candidate action mt := arg maxi Yi(t). Compute πmt(t) := P maxi Yi(t) = Ymt(t) Ht . for l = 1 to Mt do if πmt(t) γ then Sample another β1(t), . . . , βN(t), observe another mt, and update πmt(t). else Break. end if end for Set at = mt, and play arm at. Observe reward Yat(t) and compute Y DR i (t) Ft = Ft 1 + PN i=1 Xi(t)Y DR i (t); Wt = Wt 1 + PN i=1 Xi(t)Xi(t)T ; Vt = Wt + λ t I bβt = V 1 t Ft Update βt+1 for next round. end for The resampling step is incorporated to avoid small values of the probability of selection so that the pseudo-reward in (1) is numerically stable. A naive remedy to stabilize the pseudo-reward is to use max{πi(t), γ}, which fails to leading to our regret bound since it induces bias and also cannot guarantee that the selected arm is in the super-unsaturated arms defined in (5) with high probability (For details, see Section 4.2). The resampling step implemented in the proposed algorithm is designed to solve these problems. 4 Theoretical results Our theoretical results are organized as follows. In Section 4.1, we provide the main result, the cumulative regret bound of O(φ 2 T) of DRTS. The main thrust of deriving the regret bound is to define super-unsaturated arms. In Section 4.2 we introduce the definition of super-unsaturated arms and show how it admits a novel decomposition of the regret into two additive terms as in (6). In Section 4.3 we bound each term of the decomposed regret bounds (6). The first term is the estimation error, and Theorem 3 finds its bound. In the course of proving Theorem 3, we need Lemma 4, which plays a similar role to the self-normalized theorem of Abbasi-Yadkori et al. [2011]. We conclude the section by presenting Lemma 6 and bound the second term of (6). 4.1 An improved regret bound Theorem 1 provides the regret bound of DRTS in terms of the minimum eigenvalue without d. Theorem 1. Suppose that Assumptions 1-4 hold. If βt in Algorithm 1 satisfies βt β 2 b for a constant b > 0, for all t = 1, . . . , T, then with probability 1 2δ, the cumulative regret by time T for DRTS algorithm is bounded by R(T) 2 + 4Cb,σ T log 12T 2 where Cb,σ is a constant which depends only on b and σ. The bound (3) has a rate of O(φ 2 T). The relationship between the dimension d and the minimum eigenvalue φ2 can be shown by i=1 Xi(t)Xi(t)T ! i=1 Tr Xi(t)Xi(t)T = 1 i=1 Xi(t) 2 2 1. This implies φ 2 d, 2 but there are many practical scenarios such that φ 2 = O(d) holds. Bastani et al. [2021] identifies such examples including the uniform distribution and truncated multivariate normal distributions. When the context has uniform distribution on the unit ball, φ 2 = d + 2. When the context has truncated multivariate normal distribution with mean 0 and covariance Σ, we can set φ 2 = (d + 2) exp( 1 2λmin(Σ)). For more examples, we refer to Bastani et al. [2021]. Furthermore, regardless of distributions, φ 2 = O(d) holds when the correlation structure has the row sum of offdiagonals independent of the dimension, for example, AR(1), tri-diagonal, block-diagonal matrices. In these scenarios, the regret bound in (3) becomes O(d T). Compared to the previous bound of Lin TS [Agrawal and Goyal, 2014, Abeille et al., 2017], we obtain a better regret bound by the factor of d for identified practical cases. As for the imputation estimator ˇβt, we assume that ˇβt β 2 b, where b is an absolute constant. We suggest two cases which guarantee this assumption. First, if a biased estimator is used, we can rescale the estimator so that its l2-norm is bounded by some constant C > 0. Then, ˇβt β 2 ˇβt 2 + β 2 C + 1 and b = C + 1. Second, consistent estimators such as ridge estimator or the least squared estimator satisfy the condition since ˇβt β 2 = O(d p log t/t). The term d is cancelled out when t td, where td is the minimum integer that satisfies log t/t d 2. In these two cases, we can find a constant b which satisfies the assumption on the imputation estimator ˇβt. 4.2 Super-unsaturated arms and a novel regret decomposition The key element in deriving (3) is to decompose the regret into two additive terms as in (6). To allow such decomposition to be utilizable, we need to define a novel set of arms called super-unsaturated arms, which replaces the role of unsaturated arms in [Agrawal and Goyal, 2014]. The superunsaturated arms are formulated so that the chosen arm is included in this set with high probability. For each i and t, let i(t) := Xa t (t)T β Xi(t)T β. Define At := Pt τ=1 Xaτ (τ)Xaτ (τ)T +λI and Vt := Pt τ=1 PN i=1 Xi(τ)Xi(τ)T + λt I. For the sake of contrast, recall the definition of unsaturated arms by Agrawal and Goyal [2014] Ut := n i : i(t) gt Xi(t) A 1 t 1 where gt := C p d log(t/δ) min{ d, log N} for some constant C > 0. This gt is constructed to ensure that there exists a positive lower bound for the probability that the selected arm is unsaturated. In place of (4), we define a set of super-unsaturated arms for each round t by Nt := i : i(t) 2 bβt 1 β 2 + r Xa t (t) 2 V 1 t 1 + Xi(t) 2 V 1 t 1 While gt Xi(t) A 1 t 1 in (4) is normalized with only selected contexts, the second term in the right hand side of (5) is normalized with all contexts including Xa t (t), the contexts of the optimal arm. This bound of i(t) plays a crucial role in bounding the regret with a novel decomposition as in (6). The following Lemma shows a lower bound of the probability that the candidate arm is super-unsaturated. Lemma 2. For each t, let mt := arg maxi Yi(t) and let Nt be the super-unsaturated arms defined in (5). For any given γ [1/(N + 1), 1/N), set v = (2 log (N/(1 γN))) 1/2. Then, P (mt Nt| Ht) 1 γ. Lemma 2 directly contributes to the reduction of d in the hyperparamter v. In Agrawal and Goyal [2014], to prove a lower bound of P (at Ut| Ht), it is required to set v = p 9d log(t/δ), with the 2Some previous works assume φ 2 = O(1) even when Xi(t) 2 1 (e.g. Li et al. [2017]). As pointed out by Ding et al. [2021], this assumption is unrealistic and the reported regret bound should be multiplied by O(d). d. In contrast, Lemma 2 shows that v does not need to depend on d due to the definition of super-unsaturated arms in (5). In this way, we obtain a lower bound of P (mt Nt| Ht) without costing extra Using the lower bound, we can show that the resampling scheme allows the algorithm to choose the super-unsaturated arms with high probability. For all i / Nt, πi(t) := P (mt = i| Ht) P j / Nt{mt = j} Ht = P (mt / Nt| Ht) γ, where the last inequality holds due to Lemma 2. Thus, in turn, if πi(t) > γ, then i Nt. This means that {i : πi(t) > γ} is a subset of Nt and {at {i : πi(t) > γ}} {at Nt}. Hence, the probability of the event {at Nt} is greater than the probability of sampling any arm which satisfies πi(t) > γ. Therefore, with resampling, the event {at Nt} occurs with high probability. (See supplementary materials Section A for details.) When the algorithm chooses the arm from the super-unsaturated set, i.e., when at Nt happens, (5) implies at(t) 2 bβt 1 β 2 + r Xa t (t) 2 V 1 t 1 + Xat(t) 2 V 1 t 1. (6) By definition, at(t) = regret(t) and the regret at round t can be expressed as the two additive terms, which presents a stark contrast with multiplicative decomposition of the regret in Agrawal and Goyal [2014]. In section 4.3 we show how each term can be bounded with separate rate. 4.3 Bounds for the cumulative regret We first bound the leading term of (6) and introduce a novel estimation error bound free of d for the contextual bandit DR estimator. Theorem 3. (A dimension-free estimation error bound for the contextual bandit DR estimator.) Suppose Assumptions 1-4 hold. For each t = 1, . . . , T, let βt be any Ht-measurable estimator satisfying βt β 2 b, for some constant b > 0. For each i and t, assume that πi(t) > 0 and that there exists γ [1/(N + 1), 1/N) such that πat(t) > γ. Given any δ (0, 1), set t log 12τ 2 δ . Then with probability at least 1 δ, the estimator bβt in (2) satisfies bβt β 2 Cb,σ for all t = 1, . . . , T, where the constant Cb,σ which depends only on b and σ. In bandit literature, estimation error bounds typically include a term involving d which emerges from using the following two Lemmas: (i) the self-normalized bound for vector-valued martingales [Abbasi-Yadkori et al., 2011, Theorem 1], and (ii) the concentration inequality for the covariance matrix [Tropp, 2015, Corollary 5.2]. Instead of using (i) and (ii), we develop the two dimension-free bounds in Lemmas 4 and 6, to replace (i) and (ii), respectively. With the two Lemmas, we eliminate the dependence on d and express the estimation error bound with φ2 alone. Lemma 4. (A dimension-free bound for vector-valued martingales.) Let {Fτ}t τ=1 be a filtration and {η(τ)}t τ=1 be a real-valued stochastic process such that η(τ) is Fτ-measurable. Let {X(τ)}t τ=1 be an Rd-valued stochastic process where X(τ) is Fτ 1-measurable and X(τ) 2 1. Assume that {η(τ)}t τ=1 are σ-sub-Gaussian as in Assumption 2. Then with probability at least 1 δ/t2, there exists an absolute constant C > 0 such that τ=1 η(τ)X(τ) Compared to Theorem 1 of Abbasi-Yadkori et al. [2011], our bound (8) does not involve d, yielding a dimension-free bound for vector-valued martingales. However, the bound (8) has t term which comes from using 2 instead of the self-normalized norm V 1 t . To complete the proof of Theorem 3, we need the following condition, λmin (Vt) ct, (9) for some constant c > 0. Li et al. [2017] points out that satisfying (9) is challenging. To overcome this difficulty, Amani et al. [2019] and Bastani and Bayati [2020] use an assumption on the covariance matrix of contexts and a concentration inequality for matrix to prove (9), described as follows. Proposition 5. [Tropp, 2015, Theorem 5.1.1] Let P(1), . . . , P(t) Rd d be the symmetric matrices such that λmin(P(τ)) 0, λmax(P(τ)) L and λmin(E[P(τ)]) φ2, for all τ = 1, 2, . . . , t. Then, To prove (9) using (10) with probability at least 1 δ, for δ (0, 1), it requires t 8L δ . Thus, one can use (10) only after O(φ 2 log d) rounds. Due to this requirement, Bastani and Bayati [2020] implements the forced sampling techniques for O N 2d4(log d)2 rounds, and Amani et al. [2019] forces to select arms randomly for O φ 2 log d rounds. These mandatory exploration phase empirically prevents the algorithm choosing the optimal arm. An alternative form of matrix Chernoff inequality for adapted sequences is Theorem 3 in Tropp [2011], but the bound also has a multiplicative factor of d. Instead of applying Proposition 5 to prove (9), we utilize a novel dimension-free concentration inequality stated in the following Lemma. Lemma 6. (A dimension-free concentration bound for symmetric bounded matrices.) Let A F be a Frobenious norm of a matrix A. Let {P(τ)}t τ=1 Rd d be the symmetric matrices adapted to a filtration {Fτ}t τ=1. For each τ = 1, . . . , t, suppose that P(τ) F c, for some c > 0 and λmin (E [P(τ)| Fτ 1]) φ2 > 0, almost surely. For given any δ (0, 1), set δ . Then with probability at least 1 δ/t2, τ=1 P(τ) + λt I Lemma 6 shows that setting λt with t rate guarantees (9) for all t 1. We incorporate λt stated in Lemma 6 in our estimator (2), and show in Section 5 that the DR estimator regularized with λt outperforms estimators from other contextual bandit algorithms in early rounds. We obtain the bounds free of d in Lemmas 4 and 6 mainly by applying Lemma 2.3 in Lee et al. [2016] which states that any Hilbert space martingale can be reduced to R2. Thus, we can project the vector-valued (or the matrix) martingales to R2-martingales, and reduce the dimension from d (or d2) to 2. Then we apply Azuma-Hoeffding inequality just twice, instead of d times. In this way, Lemma 6 provides a novel dimension-free bound for the covariance matrix. Lemmas 4 and 6 can be applied to other works to improve the existing bounds. For example, using these Lemmas, the estimation error bound of Bastani and Bayati [2020] can be improved by a factor of log d. Proposition EC.1 of Bastani and Bayati [2020] provides an estimation error bound for the ordinary least square estimator by using Proposition 5 and bounding all values of d coordinates. By applying Lemmas 4 and 6, one does not have to deal with each coordinate and eliminate dependence on d. Using Lemma 6, we can bound the second term of the regret in (6) as follows. For j = 1, . . . , N Xj(t) V 1 t 1 Xj(t) 2 q V 1 t 1 2 λmin (Vt 1) 1/2 1 p φ2N(t 1) . (12) Finally, we are ready to bound regret(t) in (6). Lemma 7. Suppose the assumptions in Theorem 1 hold. Then with probability at least 1 2δ, regret(t) 2Cb,σ φ2 t 1 N(t 1) , (13) for all t = 2, . . . , T. Figure 1: A Comparison of cumulative regrets and estimation errors of Lin TS, BLTS and DRTS. Each line shows the averaged cumulative regrets (estimation errors, resp.) and the shaded area in the right two figures represents the standard deviations over 10 repeated experiments. Proof. Since at is shown to be super-unsaturated with high probability, we can use (6) to have regret(t) 2 bβt 1 β 2 + q Xa t (t) 2 V 1 t 1 + Xat(t) 2 V 1 t 1, for all t = 2, . . . , T. We see that the first term is bounded by Theorem 3, and the second term by (12). Note that to prove Theorem 1, Lemma 6 is invoked, and the event (11) of Lemma 6 is a subset of that in (7). Therefore (13) holds with probability at least 1 2δ instead of 1 3δ. Details are given in supplementary materials. Lemma 7 shows that the regret at round t does not exceed a O(φ 2t 1/2) bound when at Nt, which is guaranteed in our algorithm via resampling with high probability (See Section A.3 for details). This concludes the proof of Theorem 1. 5 Simulation studies In this section, we compare the performances of the three algorithms: (i) Lin TS [Agrawal and Goyal, 2013], (ii) BLTS [Dimakopoulou et al., 2019], and (iii) the proposed DRTS. We use simulated data described as follows. The number of arms N is set to 10 or 20, and the dimension of contexts d is set to 20 or 30. For each element of the contexts j = 1, , d, we generate [X1j(t), , XNj(t)] from a normal distribution N(µN, VN) with mean µ10 = [ 10, 8, , 2, 2, , 8, 10]T , or µ20 = [ 20, 18, , 2, 2, , 18, 20]T , and the covariance matrix VN RN N has VN(i, i) = 1 for every i and VN(i, k) = ρ for every i = k. We set ρ = 0.5 and truncate the sampled contexts to satisfy Xi(t) 2 1. To generate the stochastic rewards, we sample ηi(t) independently from N(0, 1). Each element of β follows a uniform distribution, U( 1/ All three algorithms have v as an input parameter which controls the variance of βi(t). BLTS and DRTS require a positive threshold γ which truncates the selection probability. We consider v {0.001, 0.01, 0.1, 1} in all three algorithms, γ {0.01, 0.05, 0.1} for BLTS, and set γ = 1/(N + 1) in DRTS. Then we report the minimum regrets among all combinations. The regularization parameter is λt = t in DRTS and λt = 1 in both Lin TS and BLTS. To obtain an imputation estimator ˇβt required in DRTS, we use ridge regression with {Xaτ (τ), Yaτ (τ)}t 1 τ=1, for each round t. Other implementation details are in supplementary materials. Figure 1 shows the average of the cumulative regrets and the estimation error bβt β 2 of the three algorithms based on 10 replications. The figures in the two left columns show the average cumulative regret according to the number of rounds with the best set of hyperparameters for each algorithm. The total rounds are T = 20000. The figures in the third columns show the average of the estimation error bβt β 2. In the early stage, the estimation errors of Lin TS and BLTS increase rapidly, while that of DRTS is stable. The stability of the DR estimator follows possibly by using full contexts and the regularization parameter λt = t. This yields a large margin of estimation error among Lin TS, BLTS and DRTS, especially when the dimension is large. 6 Conclusion In this paper, we propose a novel algorithm for stochastic contextual linear bandits. Viewing the bandit problem as a missing data problem, we use the DR technique to employ all contexts including those that are not chosen. With the definition of super-unsaturated arms, we show a regret bound which only depends on the minimum eigenvalue of the sample covariance matrices. This new bound has O(d T) rate in many practical scenarios, which is improved by a factor of d compared to the previous Lin TS regret bounds. Simulation studies show that the proposed algorithm performs better than other Lin TS algorithms in a large dimension. Acknowledgements This work is supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT, No.2020R1A2C1A01011950) (Wonyoung Kim and Myunghee Cho Paik), and by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2020-0-01336, Artificial Intelligence Graduate School Program(UNIST)) and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT, No.2021R1G1A100980111) (Gi-Soo Kim). Wonyoung Kim was also supported by Hyundai Chung Mong-koo foundation. Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312 2320, 2011. Marc Abeille, Alessandro Lazaric, et al. Linear thompson sampling revisited. Electronic Journal of Statistics, 11(2):5165 5197, 2017. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127 135, 2013. Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs, 2014. Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits under safety constraints. In Advances in Neural Information Processing Systems, pages 9252 9262, 2019. Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357 367, 1967. Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962 973, 2005. Hamsa Bastani and Mohsen Bayati. Online decision making with high-dimensional covariates. Operations Research, 68(1):276 294, 2020. Hamsa Bastani, Mohsen Bayati, and Khashayar Khosravi. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329 1349, 2021. Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In J. Shawe Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2249 2257. Curran Associates, Inc., 2011. Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet Mathematics, 3(1):79 127, 2006. Maria Dimakopoulou, Zhengyuan Zhou, Susan Athey, and Guido Imbens. Balanced linear contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3445 3453, 2019. Qin Ding, Cho-Jui Hsieh, and James Sharpnack. An efficient algorithm for generalized linear bandit: Online stochastic gradient descent and thompson sampling. In International Conference on Artificial Intelligence and Statistics, pages 1585 1593. PMLR, 2021. Alexander Goldenshluger and Assaf Zeevi. A linear response bandit problem. Stochastic Systems, 3 (1):230 261, 2013. Nima Hamidi and Mohsen Bayati. On worst-case regret of linear thompson sampling, 2020. Gi-Soo Kim and Myunghee Cho Paik. Doubly-robust lasso bandit. In Advances in Neural Information Processing Systems, pages 5869 5879, 2019. James R. Lee, Yuval Peres, and Charles K. Smart. A gaussian upper bound for martingale small-ball probabilities. Ann. Probab., 44(6):4184 4197, 11 2016. Lihong Li, Yu Lu, and Dengyong Zhou. Provably optimal algorithms for generalized linear contextual bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2071 2080. JMLR. org, 2017. James M. Robins, Andrea Rotnitzky, and Lue Ping Zhao. Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 89(427): 846 866, 1994. William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Joel A Tropp. User-friendly tail bounds for matrix martingales. Technical report, CALIFORNIA INST OF TECH PASADENA, 2011. Joel A Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1 230, 2015. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.