# doublyrobust_lasso_bandit__a22d8dc7.pdf

Doubly-Robust Lasso Bandit

Gi-Soo Kim Department of Statistics Seoul National University gisoo1989@snu.ac.kr

Myunghee Cho Paik Department of Statistics Seoul National University myungheechopaik@snu.ac.kr

Contextual multi-armed bandit algorithms are widely used in sequential decision tasks such as news article recommendation systems, web page ad placement algorithms, and mobile health. Most of the existing algorithms have regret proportional to a polynomial function of the context dimension, d. In many applications however, it is often the case that contexts are high-dimensional with only a sparse subset of size s0( d) being correlated with the reward. We consider the stochastic linear contextual bandit problem and propose a novel algorithm, namely the Doubly-Robust Lasso Bandit algorithm, which exploits the sparse structure of the regression parameter as in Lasso, while blending the doubly-robust technique used in missing data literature. The high-probability upper bound of the regret incurred by the proposed algorithm does not depend on the number of arms and scales with log(d) instead of a polynomial function of d. The proposed algorithm shows good performance when contexts of different arms are correlated and requires less tuning parameters than existing methods.

1 Introduction

Many sequential decision problems can be framed as the multi-armed bandit (MAB) problem (Robbins, 1952; Lai and Robbins, 1985), where a learner sequentially pulls arms and receives random rewards that possibly differ by arms. While the reward compensation mechanism is unknown, the learner can adapt his (her) decision to the past reward feedback so as to maximize the sum of rewards. Since the rewards of the unchosen arms remain unobserved, the learner should carefully balance between exploitation", pulling the best arm based on information accumulated so far, and exploration", pulling the arm that will assist in future choices, although it does not seem to be the best option at the moment. Application areas include the mobile healthcare system (Tewari and Murphy, 2017), web page ad placement algorithms (Langford et al., 2008), news article placement algorithms (Li et al., 2010), revenue management (Ferreira et al., 2018), marketing (Schwartz et al., 2017), and recommendation systems (Kawale et al., 2015).

Contextual MAB algorithms make use of side information, called context, given in the form of ﬁnite-dimensional covariates. For example, in the news article recommendation example, information on the visiting user as well as the articles are given in the form of a context vector. In 2010, the Yahoo! team (Li et al., 2010) proposed a contextual MAB algorithm which assumes a linear relationship between the context and reward of each arm. This algorithm achieved a 12.5% click lift compared to a context-free MAB algorithm. Many other bandit algorithms have been proposed for linear reward models (Auer, 2002; Dani et al., 2008; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agrawal and Goyal, 2013).

The aforementioned methods require the dimension of the context not be too large. This is because in these algorithms, the cumulative gap between the rewards of the optimal arm and the chosen arm, namely regret, is shown to be proportional to a polynomial function of the dimension of the context, d. In modern applications however, it is often the case that the web or mobile-based contextual variables

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

are high-dimensional. Li et al. (2010) applied a dimension reduction method (Chu et al., 2009) to the context variables before applying their bandit algorithm to the Yahoo! news article recommendation log data.

In this paper, we introduce a novel approach to the case where the contextual variables are highdimensional but only a sparse subset of size s0 is correlated with the reward. We speciﬁcally consider the stochastic linear contextual bandit (LCB) problem, where the N arms are represented by N different contexts b1(t), , b N(t) Rd at a speciﬁc time t, and the rewards r1(t), , r N(t) R are controlled by a single linear regression parameter β Rd, i.e., ri(t) has mean bi(t)T β for i = 1, , N. The LCB problem is distinguished from the contextual bandit problem with linear rewards (CBL), where the context b(t) Rd does not differ by arms but the the reward of the i-th arm is determined by an arm-speciﬁc parameter βi Rd, i.e., ri(t) has mean b(t)T βi for i = 1, , N.

When the number of arms is large, the CBL approach is not practical due to large number of parameters. Methods for CBL are also not suited to handle the case where the set of arms changes with time. When we recommend a news article or place an advertisement on the web page, the lists of news articles or advertisements change day by day. In this case, it would not be feasible to assign a different parameter for every new incoming item. Therefore, when the number of arms is large and the set of arms changes with time, LCB approaches including the proposed method can be applied while the CBL approaches cannot.

In supervised learning, Lasso (Tibshirani, 1996) is a good tool for estimating the linear regression parameter when the covariates are high-dimensional but only a sparse subset is related to the outcome. However, the fast convergence property of Lasso is guaranteed when data are i.i.d. and when the observed covariates are not highly correlated, the latter referred to as the compatibility condition (van de Geer and Bühlmann, 2009). In the contextual bandit setting, these conditions are often violated because the observations are adapted to the past and the context variables for which the rewards are observed converge to a small region of the whole context distribution as the learner updates its arm selection policy.

In the proposed method, we resolve the non-compatibility issue by coalescing the methods from missing data literature. We start from the observation that the bandit problem is a missing data problem since the rewards for the arms that are not chosen are not observed, hence, missing. The difference is that in missing data settings, missingness is controlled by the environment and given to the learner while in bandit settings, the missingness is controlled by the learner. Since the learner controls the missingness, the missing mechanism, or arm selection probability is known in bandit settings. Given this probability, we can construct unbiased estimates of the rewards for the arms not chosen, using the doubly-robust technique (Bang and Robins, 2005), which allows capitalizing on the context information corresponding to the arms that are not chosen. These data are observed and available, however, were not utilized by most of the existing contextual bandit algorithms.

We propose the Doubly-Robust Lasso Bandit algorithm which hinges on the sparse structure as in Lasso (Tibshirani, 1996), but utilizes contexts for unchosen arms by blending the doubly-robust technique Bang and Robins (2005) used in missing data literature. The high-probability upper bound of the regret incurred by the proposed algorithm has order O(s0log(d T)

T) where T is the total number of time steps. We note that the regret does not depend on the number of arms, N, and scales with log(d) instead of a polynomial function of d. Therefore, the regret of the proposed algorithm is sublinear in T even when N T and d scales with T.

Abbasi-Yadkori et al. (2012), Carpentier and Munos (2012) and Gilton and Willett (2017) considered the sparse LCB problem as well. While the high-probability regret bound of Abbasi-Yadkori et al. (2012) does not depend on N, it is proportional to

d instead of logd, so is not sublinear in T when d scales with T. Carpentier and Munos (2012) used an explicit exploration phase to identify the support of the regression parameter using techniques from compressed sensing. Their regret bound is tight scaling with logd, but the algorithm is speciﬁc to the case where the set of arms is the unit ball for the || ||2 norm and ﬁxed over time. Gilton and Willett (2017) leveraged ideas from linear Thompson Sampling and Relevance Vector Machines (Tipping, 2001). The theoretical results of Gilton and Willett (2017) are weak since they derived the regret bound under the assumption that a sufﬁciently small superset of the support for the regression parameter is known in advance.

Bastani and Bayati (2015) addressed the CBL problem with high-dimensional contexts. They proposed the Lasso Bandit algorithm which uses Lasso to estimate the parameter of each arm

separately. To solve non-compatibility, Bastani and Bayati (2015) used forced-sampling of each arm at predeﬁned time points and maintained two estimators for each arm, one based on the forced-samples and the other based on all samples. They derived a regret bound of order O Ns2 0[log T + logd]2

but in terms of expectation rather than high-probability. An application of the Hoeffding s inequality would give an additional term of order O(

T) for the high-probability bound. For the same problem, Wang et al. (2018) proposed the Minimax Concave Penalized (MCP) Bandit algorithm which uses forced-sampling along with the MCP estimator (Zhang, 2010) and improved the bound of Bastani and Bayati (2015) to O Ns2 0[s0 + logd]log T .

The regrets of Bastani and Bayati (2015) and Wang et al. (2018) increase linearly with N due to separate estimation for each arm. When the number of arms N is bigger than T, these algorithms should terminate before the forced-sampling of each arm is completed, thus may not be practical. In contrast, the proposed method allows to share information across arms so its performance does not depend on N. In cases where all three methods are applicable, the proposed algorithm requires one less tuning parameter than Lasso Bandit and MCP Bandit, which is a signiﬁcant advantage for an online learning algorithm as it is difﬁcult to simultaneously tune the hyperparameters and achieve high reward.

We summarize the main contributions of the paper as follows.

We propose a new linear contextual MAB algorithm for high-dimensional, sparse reward models. The proposed method is simple and requires less tuning parameters than previous works. We propose a new estimator for the regression parameter in the reward model which uses Lasso with the context information of all arms through the doubly-robust technique. We prove that the high-probability regret upper bound of the proposed algorithm is O(s0log(d T)

T), which does not depend on N and scales with logd. We present experimental results that show the superiority of our method especially when N is big and the contexts of different arms are correlated.

1.1 Related Work

The fast convergence of Lasso for linear regression parameter was extenstively studied in van de Geer (2007), Bickel et al. (2009), and van de Geer and Bühlmann (2009). The doubly-robust technique was originally introduced in Robins et al. (1994) and extensively analyzed in Bang and Robins (2005) under supervised learning setup. Dudík et al. (2014), Jiang and Li (2016), and Farajtabar et al. (2018) are recent works that incorporate the doubly-robust methodology into reinforcement learning but under ofﬂine learning settings. Dimakopoulou et al. (2018) was the ﬁrst to blend the doubly-robust technique in the online bandit problem. Their work, which focused on low-dimensional settings, proposed to use the doubly-robust technique as a way of balancing the data over the whole context space which makes the online regression less prone to bias when the reward model is misspeciﬁed.

2 Settings and Assumptions

In the MAB setting, the learner is repeatedly faced with N alternative arms where at time t, the i-th arm (i = 1, , N) yields a random reward ri(t) with unknown mean θi(t). We assume that there is a ﬁnite-dimensional context vector bi(t) Rd associated with each arm i at time t and that the mean of ri(t) depends on bi(t), i.e., θi(t) = θt(bi(t)), where θt( ) is an arbitrary function. Among the N arms, the learner pulls one arm a(t), and observes reward ra(t)(t). The optimal arm at time t is a (t) := argmax 1 i N {θt(bi(t))}. Let regret(t) be the difference between the expected reward of the

optimal arm and the expected reward of the arm chosen by the learner at time t, i.e., regret(t) = E ra (t)(t) ra(t)(t) {bi(t)}N i=1, a(t)

= θt(ba (t)(t)) θt(ba(t)(t)).

Then, the goal of the learner is to minimize the sum of regrets over T steps, R(T) := PT t=1 regret(t).

We speciﬁcally consider a sparse LCB problem, where θt(bi(t)) is linear in bi(t),

θt(bi(t)) = bi(t)T β, i = 1, , N, (1)

where β Rd is unknown and ||β||0 = s0( d), || ||p denoting the Lp-norm. Hence, only s0 elements in β are assumed to be nonzero. We also make the following assumptions, from A1 to A4.

A1. Bounded norms. Without loss of generality, ||bi(t)||2 1, ||β||2 1.

A2. IID Context Assumption. The distribution of context variables is i.i.d. over time t:

{b1(t), , b N(t)} i.i.d. Pb,

where Pb is some distribution over RN d.

We note in A2 that given time t, the contexts from different arms are allowed to be correlated.

A3. Compatibility Assumption (van de Geer and Bühlmann, 2009). Let I be a set of indices and φ be a positive constant. Deﬁne

C(I, φ) = {M Rd d : v Rd such that ||v IC||1 3||v I||1, ||v I||2 1 |I|(v T Mv)/φ2}.

Then φ1 > 0 such that Σ := E[ b(t) b(t)T ] C(supp(β), φ1),

where b(t) = 1

N PN i=1 bi(t), and supp(β) denotes the support of β.

A3 ensures that the Lasso estimate of β converges to the true parameter in a fast rate. This assumption is weaker than the restricted eigenvalue assumption (Bickel et al., 2009).

A4. Sub-Gaussian error. The error ηi(t) := ri(t) bi(t)T β is R-sub-Gaussian for some 0 < R < O( 4p

log T/T), i.e., for every λ R,

E[exp(ληi(t))] exp(λ2R2/2).

Assumption A4 is satisﬁed whenever ri(t) [bi(t)T β R, bi(t)T β + R].

3 Doubly-Robust Lasso Bandit

Lemma 11.2 of van de Geer and Bühlmann (2009) shows that when the covariates satisfy the compatibility condition and the noise is i.i.d. Gaussian, the Lasso estimator of the linear regression parameter converges to the true parameter in a fast rate. We restate the lemma.

Lemma 3.1. (Lemma 11.2 of van de Geer and Bühlmann, 2009) Let xτ Rd and yτ R be random variables with yτ = x T τ β + ετ, τ = 1, 2, , t, where β Rd, ||β||0 = s0, and ετ s are i.i.d. Gaussian with mean zero and variance R2. Let ˆβ(t) be the Lasso estimator of β based on the

ﬁrst t observations using λt = R q

t for the regularization parameter in the L1 penalty, i.e.,

ˆβ(t) = argmin β

τ=1 (yτ x T τ β)2 + λt||β||1 .

If ˆΣt := 1

t Pt τ=1 xτx T τ C(supp(β), φ) for some φ > 0, then for δ (0, 1), with probability at least (1 δ),

||ˆβ(t) β||1 4λts0

Two hurdles that arise when incorporating Lemma 3.1 into the contextual MAB setting are that (a) (non-i.i.d.) the errors ετ s are not i.i.d., and (b) (non-compatibility) the Gram matrix ˆΣt does not satisfy the compatibility condition even under assumption A3 because the context variables of which the rewards are observed do not evenly represent the whole distribution of the context variables. In the LCB setting, xτ corresponds to ba(τ)(τ). Since the learner tends to choose the contexts that yield maximum reward as time elapses, the chosen context variables ba(τ)(τ) s converge to a small region of the whole context space. We discuss on remedies to (a) and (b) in Section 3.1 and Section 3.2, respectively.

3.1 Lasso Oracle Inequality with non-i.i.d. noise

As a remedy to the non-i.i.d. problem, Bastani and Bayati (2015) proposed a variation of Lemma 3.1 which shows that (2) holds when {ετ}t τ=1 is a martingale difference sequence. We restate the proposition of Bastani and Bayati (2015). Lemma 3.2. (Proposition 1 of Bastani and Bayati, 2015) Let Fτ 1 denote the ﬁltration up to time τ 1 in the bandit setting,

Fτ 1 = {x1, y1, , xτ 1, yτ 1, xτ}, τ = 1, 2, , t.

Suppose the conditions of Lemma 3.1 hold except that ετ s are i.i.d. Suppose instead that ετ|Fτ 1 is R-sub-Gaussian for τ = 1, , t. Then for δ (0, 1), with probability at least (1 δ),

||ˆβ(t) β||1 4λts0

3.2 Doubly-robust pseudo-reward

To overcome the non-compatibility problem, in the CBL setting, Bastani and Bayati (2015) proposed to impose forced-sampling of each arm at predeﬁned O(log T) time steps, which produces i.i.d. data. The following lemma shows that when xτ s are i.i.d. and E(xτx T τ ) C(supp(β), φ), then ˆΣt C(supp(β), φ/

2) with high probability. Lemma 3.3. (Lemma EC.6. of Bastani and Bayati, 2015) Let x1, x2, , xt be i.i.d. random vectors in Rd with ||xτ|| 1 for all τ. Let Σ = E[xτx T τ ] and ˆΣt = 1

t Pt τ=1 xτx T τ . Suppose that Σ C(supp(β), φ). Then if t 3 c2 logd,

P h ˆΣt C supp(β), φ

i 1 exp c2t , (4)

where c = min(0.5, φ2

We derive the following corollary simply by setting the right-hand side of (4) larger than 1 δ

t2 . Corollary 3.4. Suppose that the conditions of Lemma 3.3 are satisﬁed. Let z T = max 3

δ for some δ (0, 1). Then with probability at least 1 δ ,

ˆΣt C supp(β), φ

for all t z T . (5)

The setting of Bastani and Bayati (2015) is different from ours in that they assume the context variable is the same for all arms but the regression parameter differs by arms. Bastani and Bayati (2015) maintained two sets of estimators for each arm, the estimator based on the forced-samples and the one based on all samples. Whereas the latter is not based on i.i.d. data, using the forced-sample estimator as a pre-processing step of selecting a subset of arms and then using the all-sample estimator to select the best arm among this subset guarantees that there are O(T) i.i.d. samples for each arm, so the all-sample estimator converges in a fast rate as well.

We propose a different approach to resolve non-compatibility, which is based on the doubly-robust technique in missing data literature. We deﬁne the ﬁltration Ft 1 as the union of the observations until time t 1 and the contexts given at time t, i.e.,

Ft 1 = {{bi(1)}N i=1, a(1), ra(1)(1), , {bi(t 1)}N i=1, a(t 1), ra(t 1)(t 1), {bi(t)}N i=1}.

Given Ft 1, we pull the arm a(t) randomly according to probability π(t) = [π1(t), , πN(t)], where πi(t) = P[a(t) = i|Ft 1] is the probability of pulling the i-th arm at time t given Ft 1. We specify the values of π(t) later in Section 3.3. Let ˆβ(t 1) be the β estimate given Ft 1. After the reward ra(t)(t) is observed, we construct a doubly-robust pseudo-reward:

ˆr(t) = b(t)T ˆβ(t 1) + 1

N ra(t)(t) ba(t)(t)T ˆβ(t 1)

Whether or not ˆβ(t 1) is a valid estimate, this value has conditional expectation b(t)T β given that πi(t) > 0 for all i:

E[ˆr(t)|Ft 1] = E h 1

1 I(a(t) = i)

bi(t)T ˆβ(t 1) + 1

I(a(t) = i)

πi(t) ri(t) Ft 1 i

i=1 ri(t) Ft 1 i = b(t)T β.

Thus by weighting the observed reward ra(t)(t) with the inverse of its observation probability πa(t)(t), we obtain an unbiased estimate of the reward corresponding to the average context b(t) instead of ba(t)(t). Applying Lasso regression to the pair ( b(t), ˆr(t)) instead of (ba(t)(t), ra(t)(t)), we can make use of the i.i.d. assumption A2 and the compatibility assumption A3 with Corollary 3.4 to achieve a fast convergence rate for β as in (3). We later show that when xτ in Corollary 3.4 corresponds to b(τ), (5) holds with z T = O(s2 0log(d T)).

Since the allocation probability πa(t)(t) is known given Ft 1, the simple inverse probability weighting

(IPW) estimator r(t) := ra(t)(t) Nπa(t)(t) also gives an unbiased estimate for the reward corresponding to the average context. We however show in the next paragraphs that the doubly-robust estimator has constant variance under minor condition on the weight πa(t)(t) while the IPW estimator has variance that increases with t under the same condition. Constant variance is crucial for the performance of the algorithm, since it affects the convergence property of ˆβ(t) in (3) and eventually the regret upper bound in Theorem 4.1. We can make the IPW estimator have constant variance as well under stronger condition on πa(t)(t), but this stronger condition is shown to result in regret linear in T.

Let R2 be the variance of ˆr(t) given Ft 1. We have,

R2 := V ar[ˆr(t)|Ft 1] = E ˆr(t) b(t)T β 2 Ft 1

= E hn b(t)T ˆβ(t 1) β + ηa(t)(t) Nπa(t)(t) + ba(t)(t)T (β ˆβ(t 1))

o2 Ft 1 i . (6)

Suppose ˆβ(t 1) satisﬁes the Lasso convergence property (3), i.e., ||ˆβ(t 1) β||1 O( p

(logd + logt)/t) for t > z T . Then we can bound the ﬁrst and third terms of (6) by a constant under the following restriction,

πa(t)(t) = 1

N for all t z T O 1

(logd + logt)/t for all t > z T . (7)

Due to assumption A4 on the sub-Gaussian error η, the second term of (6) is also bounded by a constant under (7). Hence, we have R2 O(1). Therefore if (7) holds and if ˆβ(z T ), ˆβ(z T + 1), , ˆβ(t 1) all satisfy the Lasso convergence property (3), then the pseudo-rewards ˆr(1), , ˆr(t) all have constant variance. Consequently, the estimate ˆβ(t) based on { b(τ), ˆr(τ)}t τ=1 satisﬁes (3). Meanwhile, the restriction (7) leads to suboptimal choice of arms at each t with probability that decreases with time. In Theorem 4.1, we prove that the regret due to this suboptimal choice is sublinear in T.

The variance of r(t) given ﬁltration Ft 1 is E hn ηa(t)(t)

Nπa(t)(t)+ ba(t)(t)T β

Nπa(t)(t) b(t)T β o2 Ft 1 i . As opposed

to the ﬁrst and third terms in (6), the term ba(t)(t)T β/Nπa(t)(t) still increases with t under (7). To achieve a constant variance, we need πa(t)(t) be larger than a predetermined constant value, 1

N pmin. Simply replacing πa(t)(t) in r(t) with the truncated value πa(t)(t) = max 1

N pmin, πa(t)(t) will produce a biased estimate of b(t)T β when πa(t)(t) is actually smaller than 1

N pmin, and the resulting estimate ˆβ(t) will not satisfy convergence property (3). If we instead directly constrain πa(t)(t) to be larger than 1

N pmin, this will lead to suboptimal choice of arms at each t with constant probability so the regret will increase linearly in T.

3.3 Doubly-Robust Lasso Bandit algorithm

To make πa(t)(t) = 1 N for all t z T , we simply pull arms according to the uniform distribution when t z T . Then to ensure πa(t)(t) O 1

(logd + logt)/t for all t > z T , we randomize the arm selection at each step between uniform random selection and deterministic selection using the β estimate of the last step, ˆβ(t 1). This can be implemented in two stages. First, sample mt from a Bernouilli distribution with mean λ1t = λ1 p

(logt + logd)/t, where λ1 > 0 is a tuning parameter. Then if mt = 1, pull the arm a(t) = i with probability 1/N, otherwise, pull the arm a(t) = argmax 1 i N {bi(t)T ˆβ(t 1)}. Under this procedure, we have πa(t)(t) = λ1t/N + (1

λ1t)I a(t) = argmax 1 i N {bi(t)T ˆβ(t 1)} , which satisﬁes (7). In practice, we treat z T as a tuning

The proposed Doubly-Robust (DR) Lasso Bandit algorithm is presented in Algorithm 1. The algorithm selects a(t) via two stages, computes πa(t)(t), constructs the pseudo-reward ˆr(t) and average context b(t), and updates the β estimate using Lasso regression on the pairs ( b(t), ˆr(t)) s. Recall that the regularization parameter should be set as λ2t = O p

log(d/δ)/t to guarantee (3) for a ﬁxed t. To guarantee (3) for every t > z T , we impute δ/t2 instead of δ. Hence at time t, we update λ2t = λ2 p

(logt + logd)/t where λ2 > 0 is a tuning parameter. The algorithm requires three tuning parameters in total, while Bastani and Bayati (2015) and Wang et al. (2018) require four.

Algorithm 1 DR Lasso Bandit

Input parameters: λ1, λ2, z T Set ˆβ(0) = 0d, S = { }. for t = 1, 2, , T do

Observe {b1(t), b2(t), , b N(t)} Pb if t z T then

Pull arm a(t) = i with probability 1

N (i = 1, , N) πa(t)(t) 1/N else

(logt + logd)/t, sample mt Ber(λ1t) if mt = 1 then

Pull arm a(t) = i with probability 1

N (i = 1, , N) else

Pull arm a(t) = argmax 1 i N {bi(t)T ˆβ(t 1)}

end if πa(t)(t) λ1t/N + (1 λ1t)I n a(t) = argmax 1 i N {bi(t)T ˆβ(t 1)} o

end if Observe ra(t)(t) b(t) 1

N PN i=1 bi(t), ˆr(t) b(t)T ˆβ(t 1) + 1

N ra(t)(t) ba(t)(t)T ˆβ(t 1)

πa(t)(t) S S b(t), ˆr(t)

(logt + logd)/t ˆβ(t) argmin β

( b,ˆr) S(ˆr b T β)2 + λ2t||β||1 o

4 Regret analysis

Under (1), the regret at time t is regret(t) = ba (t)(t)T β ba(t)(t)T β, where a (t) = argmax 1 i N {bi(t)T β}. We establish the high-probability regret upper bound for Algorithm 1 in the

following theorem.

Theorem 4.1. Suppose (1) and assumptions A1, A2, A3, and A4 hold. Then we have for δ (0, 1

2) and δ (0, δ), with probability at least 1 2δ,

R(T) max 3 C(φ1)2 logd, 1 C(φ1)2 log(T 2

128 φ2 1 s0 R

Tlog ed T 2

Tlog(d T) ,

where C(φ1) = min(0.5, φ2 1 256s0 ) and R = O(1).

We provide a sketch of proof in the next section. A complete proof is presented in the Supplementary Material.

4.1 Sketch of proof for Theorem 4.1

In the DR Lasso Bandit algorithm, each of the T decision steps corresponds to one of the following three groups.

(a) t z T : the arms are pulled according to the uniform distribution. (b) t > z T and mt = 1 : the arms are pulled according to the uniform distribution.

(c) t > z T and mt = 0 : the arm a(t) = argmax 1 i N {bi(t)T ˆβ(t 1)} is pulled.

We let z T = max 3 C(φ1)2 logd, 1 C(φ1)2 log( T 2

δ ) . The strategy of the proof is to bound the regrets from each separate group and sum the results.

Due to assumption A1 on the norms of bi(t) and β, regret(t) is not larger than 1 in any case. Therefore, the sum of regrets from group (a) is at most z T , which corresponds to the ﬁrst term in Theorem 4.1. We now denote the sum of regrets from group (b) and group (c) as R(T, b) and R(T, c), respectively. We ﬁrst bound R(T, b) in Lemma 4.2, which follows from the fact that mt is a Bernouilli variable with mean λ1t and Hoeffding s inequality. Lemma 4.2. With probability at least 1 δ,

t=z T λ1t + p

Tlog(1/δ)/2 λ1

1 + log T + p

Tlog(1/δ)/2.

To bound R(T, c), we further decompose group (c) into the following two subgroups, where we set

128 φ2 1 s0 R q

log(edt2/(δ δ ))

t with R = O(1).

(c-a) t > z T , mt = 0 : a(t) = argmax 1 i N {bi(t)T ˆβ(t 1)} is pulled and ||ˆβ(t 1) β||1 dt 1.

(c-b) t > z T , mt = 0 : a(t) = argmax 1 i N {bi(t)T ˆβ(t 1)} is pulled and ||ˆβ(t 1) β||1 > dt 1.

We ﬁrst show in Lemma 4.3 that the subgroup (c-b) is empty with high probability. Since ˆβ(t) is computed from the pairs { b(τ), ˆr(τ)}t τ=1 where the average contexts b(τ) s satisfy the compatibility condition (5) and the pseudo-rewards ˆr(τ) s are unbiased with constant variance, Lemma 3.2 can be applied on ˆβ(t). Lemma 4.3. Suppose (1) and assumptions A1, A2, A3, and A4 hold. Let dt =

128 φ2 1 s0 R q

log(edt2/(δ δ ))

t . Then we have for δ (0, 1), for every t z T ,

P ||ˆβ(t) β||1 > dt ||ˆβ(t 1) β||1 < dt 1, , ||ˆβ(z T ) β||1 < dz T δ

Hence, with probability at least 1 δ,

||ˆβ(t) β||1 < dt for every t z T .

Figure 1: Median (solid), 1st and 3rd quartiles (dashed) of R(t) over 10 experiments.

Finally when t belongs to group (c-a), regret(t) is trivially bounded by dt. Hence we can bound R(T, c) as in the following lemma. Lemma 4.4. With probability at least 1 δ,

128 φ2 1 s0 R

Tlog ed T 2

5 Simulation study

We conduct simulation studies to evaluate the proposed DR Lasso Bandit and the Lasso bandit (Bastani and Bayati, 2015). We set N = 10, 20, 50, or 100, d = 100, and s0 = 5. For ﬁxed j = 1, , d, we generate [b1j(t), , b Nj(t)]T from N(0N, V ) where V (i, i) = 1 for every i and V (i, k) = ρ2 for every i = k. We experiment two cases for ρ2, either ρ2 = 0.3 (weak correlation) or

ρ2 = 0.7 (strong correlation). We generate ηi(t) i.i.d. N(0, 0.052) and the rewards from (1). We set ||β||0 = s0 and generate the s0 non-zero elements from a uniform distribution on [0, 1].

We conduct 10 replications for each case. The Lasso Bandit algorithm can be applied in our setting by considering a Nd-dimensional context vector b(t) = [b1(t)T , , b N(t)T ]T and a Nd-dimensional regression parameter βi for each arm i where βi = [I(i = 1)βT , , I(i = N)βT ]T . For each algorithm, we consider some candidates for the tuning parameters and report the best results. For DR Lasso Bandit, we advise to truncate the value ˆr(t) so that it does not explode.

Figure 1 shows the cumulative regret R(t) according to time t. When the number of arms N is as small as 10, Lasso Bandit converges faster to the optimal decision rule than DR Lasso Bandit. However, we notice that the regret of Lasso Bandit in the early stage is larger than DR Lasso Bandit, which is due to forced-sampling of arms. With larger number of arms, DR Lasso Bandit outperforms Lasso Bandit. Compared to Lasso Bandit, the regret of DR Lasso Bandit decreases dramatically as the correlation among the contexts of different arms increases. We also verify that the performance of the proposed method is not sensitive to N, while the regret of Lasso Bandit increases with N.

6 Concluding remarks

Sequential decision problems involving web or mobile-based data call for contextual MAB methods that can handle high-dimensional covariates. In this paper, we propose a novel algorithm which enjoys the convergence properties of the Lasso estimator for i.i.d. data via a doubly-robust technique established in the missing data literature. The proposed algorithm attains a tight high-probability regret upper bound which depends on a polylogarithmic function of the dimension and does not depend on the number of arms, overcoming weaknesses of the existing algorithms.

Abbasi-Yadkori, Y., Pál, D. and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312 2320, 2011.

Abbasi-Yadkori, Y., Pál, D. and Szepesvari, C. Online-to-conﬁdence-set conversions and application to sparse stochastic bandits. In Artiﬁcial Intelligence and Statistics, pp. 1 9, 2012.

Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pp. 127 135, 2013.

Auer, P. Using conﬁdence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3:397 422, 2002.

Bang, H. and Robins, J.M. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962-973, 2005.

Bastani, H. and Bayati, M. Online decision-making with high-dimensional covariates. Available at SSRN 2661896, 2015.

Bickel, P., Ritov, Y. and Tsybakov, A. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37:1705 1732, 2009.

Carpentier, A. and Munos, R. Bandit theory meets compressed sensing for high dimensional stochastic linear bandit. In Artiﬁcial Intelligence and Statistics (AISTATS), pp. 190 198, 2012.

Chu, W., Park, S.T., Beaupre, T., Motgi, N., Phadke, A., Chakraborty, S. and Zachariah, J. A case study of behavior-driven conjoint analysis on Yahoo!: front page today module. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1097-1104, 2009.

Chu, W., Li, L., Reyzin, L. and Schapire, R.E. Contextual bandits with linear payoff functions. In Proceedings of the 14th International Conference on Artiﬁcial Intelligence and Statistics, pp. 208 214, 2011.

Dani, V., Hayes, T. P. and Kakade, S. M. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, pp. 355 366, 2008.

Dimakopoulou, M., Zhou, Z., Athey, S. and Imbens, G. Balanced Linear Contextual Bandits. ar Xiv preprint ar Xiv:1812.06227, 2018.

Dudík, M., Erhan, D., Langford, J. and Li, L. Doubly robust policy evaluation and optimization. Statistical Science, 29(4):485-511, 2014.

Farajtabar, M., Chow, Y. and Ghavamzadeh, M. More Robust Doubly Robust Off-policy Evaluation. In Proceedings of the 35th International Conference on Machine Learning, pp. 1446-1455, 2018.

Ferreira, K., Simchi-Levi, D. and Wang, H. Online network revenue management using Thompson sampling. Operations Research, 66(6):1586 1602, 2018.

Gilton, D. and Willett, R. Sparse linear contextual bandits via relevance vector machines. In International Conference on Sampling Theory and Applications (Samp TA), pp. 518 522, 2017.

Jiang, N. and Li, L. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, pp. 652-661, 2016.

Kawale, J., Bui, H.H., Kveton, B., Tran-Thanh, L. and Chawla, S. Efﬁcient Thompson sampling for online matrix-factorization recommendation. In Advances in Neural Information Processing Systems, pp. 1297 1305, 2015.

Lai, T.L. and Robbins, H. Asymptotically efﬁcient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4 22, 1985.

Langford, J., Strehl, A. and Wortman, J. Exploration scavenging. In Proceedings of the 25th International Conference on Machine Learning, pp. 528 535, 2008.

Li, L., Chu, W., Langford, J. and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World wide web, pp. 661 670, 2010.

Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527 535, 1952.

Robins, J. M., Rotnitzky, A. and Zhao, L. P. Estimation of regression coefﬁcients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846866, 1994.

Schwartz, E.M., Bradlow, E.T. and Fader, P.S. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500 522, 2017.

Tewari, A., and Murphy, S.A. From ads to interventions: contextual bandits in mobile health. In Mobile Health (pp. 495 517). Springer, Cham, 2017.

Tibshirani, R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267-288, 1996.

Tipping, M. E. Sparse Bayesian learning and the relevance vector machine. Journal of machine learning research, 1(Jun):211-244, 2001.

van de Geer, S. The deterministic Lasso. In JSM proceedings, American Statistical Association, (see also http://stat.ethz.ch/research/research reports/2007/140), 2007.

van de Geer, S. A. and Bühlmann, P. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3:1360 1392, 2009.

Wang, X., Wei, M. M. and Yao, T. Minimax Concave Penalized Multi-Armed Bandit Model with High-Dimensional Convariates. In Proceedings of the 35th International Conference on Machine Learning, pp. 5187 5195, 2018.

Yahoo! Webscope. Yahoo! Front Page Today Module User Click Log Dataset, version 1.0. http: //webscope.sandbox.yahoo.com. Accessed: 09/01/2019.

Zhang, C. H. Nearly unbiased variable selection under minimax concave penalty. Annals of statistics, 38(2):894-942, 2010.