# langevin_monte_carlo_for_contextual_bandits__c3be6c2d.pdf Langevin Monte Carlo for Contextual Bandits Pan Xu 1 Hongkai Zheng 1 Eric Mazumdar 1 Kamyar Azizzadenesheli 2 Anima Anandkumar 1 We study the efficiency of Thompson sampling for contextual bandits. Existing Thompson samplingbased algorithms need to construct a Laplace approximation (i.e., a Gaussian distribution) of the posterior distribution, which is inefficient to sample in high dimensional applications for general covariance matrices. Moreover, the Gaussian approximation may not be a good surrogate for the posterior distribution for general reward generating functions. We propose an efficient posterior sampling algorithm, viz., Langevin Monte Carlo Thompson Sampling (LMC-TS), that uses Markov Chain Monte Carlo (MCMC) methods to directly sample from the posterior distribution in contextual bandits. Our method is computationally efficient since it only needs to perform noisy gradient descent updates without constructing the Laplace approximation of the posterior distribution. We prove that the proposed algorithm achieves the same sublinear regret bound as the best Thompson sampling algorithms for a special case of contextual bandits, viz., linear contextual bandits. We conduct experiments on both synthetic data and real-world datasets on different contextual bandit models, which demonstrates that directly sampling from the posterior is both computationally efficient and competitive in performance. 1. Introduction A bandit problem is a sequential decision-making problem wherein an agent, in each round, observes an action set, chooses an action (or arm) from the set, and then observes a reward from the environment. A bandit learning algorithm aims to learn a policy for the agent to maximize its cumula- 1Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA 2Department of Computer Science, Purdue University, West Lafayette, IN, USA. Correspondence to: Pan Xu . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). tive rewards based on its historical observations of actions and rewards. In the vast majority of real-world applications, each arm is usually associated with side information in the form of a feature or context vector that describes the arm. The mean reward for an arm is expressed as some unknown function of the arm s feature vector and an unknown weight parameter that is shared across different arms. This setting known as the contextual bandit problem has been extensively studied in the literature (Langford & Zhang, 2007; Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013; Filippi et al., 2010; Li et al., 2017; Lale et al., 2019; Kveton et al., 2020). The main challenge in contextual bandit problems is addressing the well known exploitation versus exploration trade-off, which requires a careful balance between choosing the myopically better arm and choosing an under-sampled worse arm. Existing algorithms for maximizing the cumulative reward in bandit problems mainly follow either one of the following two algorithmic frameworks. The first framework follows the principle of optimism in the face of uncertainty (OFU), and algorithms designed using such ideas have been widely applied to both finite armed bandits, also known as multi-armed bandits (MAB) (Auer et al., 2002; M enard & Garivier, 2017), and contextual bandits (Chu et al., 2011; Abbasi-Yadkori et al., 2011; Li et al., 2017; Zhou et al., 2020; Xu et al., 2022). The second dominant category of bandit algorithm makes use of the idea of Thompson or posterior sampling (Thompson, 1933). Such algorithms have been widely used in practice due to their ease of implementation and impressive empirical performance, and have only recently started to be well understood theoretically in multi-armed bandits (Agrawal & Goyal, 2012; Kaufmann et al., 2012; Russo & Van Roy, 2014; Jin et al., 2021) and contextual bandits (Chapelle & Li, 2011; Agrawal & Goyal, 2013; Riquelme et al., 2018; Wang & Zhou, 2020; Zhang et al., 2021). One crucial area in which the two types of algorithms differ is in their ease of implementation. In contextual bandit problems, algorithms based on the OFU principle usually need to solve a bi-linear optimization problem, making them computationally expensive to implement outside of simple problems despite coming with stronger theoretical guarantees. In contrast, Thompson sampling algorithms only need to solve a linear optimization problem over the arm set since Langevin Monte Carlo for Contextual Bandits the uncertainty in the posterior distribution automatically accounts for the exploration in the parameter space. Furthermore, Thompson sampling has been observed to be empirically competitive to or sometimes even better than OFU based algorithms (Chapelle & Li, 2011). Most existing Thompson sampling algorithms first construct a Laplace approximation (which is essentially a Gaussian distribution) (Chapelle & Li, 2011) of the underlying posterior distribution on the data and then sample from the Gaussian distribution to explore in the parameter space. The Laplace approximation in Thompson sampling usually leads to a non-isotropic covariance matrix. It is well known that sampling from a Gaussian distribution with a general covariance matrix is usually computationally expensive in high dimensional applications. Moreover, when the reward generating function is nonlinear with respect to the weight parameter such as in generalized linear bandits (Kveton et al., 2020) and neural contextual bandits (Riquelme et al., 2018; Zhang et al., 2021), the true posterior distribution may not be well approximated by a Gaussian distribution and thus the Laplace approximation could be a poor surrogate for the posterior distribution. Our approach: In this paper, we propose an algorithm, viz., Langevin Monte Carlo Thompson sampling (LMC-TS), which directly samples from the data posterior distribution instead of a Laplace approximation in contextual bandits. In particular, by incorporating Langevin Monte Carlo (Bakry et al., 2014), our algorithm only needs to perform noisy gradient descent updates, which can generate samples that provide a good approximation of the posterior with arbitrary accuracy if it is run for sufficiently many steps. This is contrast with Laplace approximation for Thompson sampling (Chapelle & Li, 2011), which has a fixed approximation error for the posterior distribution and thus the covariance matrix needs to be carefully redesigned in different contextual bandit problems to achieve reasonable performance (Chapelle & Li, 2011; Kveton et al., 2020; Riquelme et al., 2018; Zhang et al., 2021). Moreover, due to the simplicity of noisy gradient descent updates, the proposed algorithm is directly applicable to many bandit problems where deep neural network function classes are used. Contributions of this paper are summarized as follows: We propose a practical and efficient bandit algorithm LMC-TS, which only needs to perform noisy gradient descent updates to approximately sample from the data posterior distribution. LMC-TS is easily implementable and scalable to large-scale and high dimensional problems including deep learning applications. It also works simultaneously for a large class of contextual bandit models including linear contextual bandits, generalized linear bandits, and neural contextual bandits. We theoretically prove that LMC-TS achieves a e O(d d T) regret for linear contextual bandits, where d is the dimension of the problem and T is the time horizon. This result matches the best regret bound for Thompson sampling algorithms in linear contextual bandits (Agrawal & Goyal, 2013). We further conduct thorough experiments on both synthetic datasets and real-world datasets (UCI machine learning datasets and a high dimensional image dataset CIFAR10) to show that one algorithm is enough for learning many different complex bandit models by comparing it with different baseline algorithms in linear contextual bandits, generalized bandits, and neural contextual bandits respectively. Notation We use [k] to denote a set {1, . . . , k}, k N+. x 2 = x x is the Euclidean norm of a vector x Rd. For a matrix V Rm n, we denote by V 2 and V F its operator norm and Frobenius norm respectively. For a semipositive definite matrix V Rd d and a vector x Rd, we denote the Mahalanobis norm as x V = x Vx. For an event E on a probability space, we denote Ec as its complement event such that P(E) + P(Ec) = 1. For a function f(T), we use the common big O notation O(f(T)) to hide constant factors with respect to T and use e O(f(T)) to omit the logarithmic dependence on T. 2. Preliminary Contextual Bandits Contextual bandits are a wide class of sequential decision problems, where the player makes the decision based on an observation of an action set consisting of feature vectors as contexts for different actions. In particular, at round t, the player observes an action set Xt Rd, and chooses an arm or action which is represented by a feature vector xt Xt. Note that in this paper we do not assume the action set is finite nor is fixed in each round. Then a reward rt is immediately revealed to the agent by the environment. In contextual bandit problems, it is often assumed that the mean reward of an action with feature x Rd is given by a reward generating function f(x, θ ) and the observed reward is r(x) = f(x, θ ) + ξ, where θ Rd is an unknown weight parameter that is shared across all arms, and ξ is a random noise incurred in the observation. For instance, in linear contextual bandits (Chu et al., 2011; Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013), we have θ Rd and f(x, θ ) = x θ ; in generalized linear bandits (Filippi et al., 2010; Li et al., 2017; Kveton et al., 2020; Ding et al., 2021), we have f(x, θ ) = µ(x θ ) for some link function µ( ); and for neural contextual bandits (Riquelme et al., 2018; Zhou et al., 2020; Zhang et al., 2021; Xu et al., 2022), f(x, θ ) is a neural network, where θ is the concatenation of all weight parameters and x is Langevin Monte Carlo for Contextual Bandits The objective of a bandit algorithm is to maximize the cumulative rewards over a time horizon T, which is equivalent to minimizing the following pseudo regret (Lattimore & Szepesv ari, 2020): t=1 (r(x t ) r(xt)) where xt Xt is the arm chosen by the bandit algorithm at round t, and x t = argmaxx Xt E[r(x)] is the arm with the maximum expected reward at round t. Note that this definition of regret is based on the best oracle arm x t , which is more general than the definition based on the best policy achievable within a predefined policy class in adversarial bandits (Bubeck & Cesa-Bianchi, 2012). Laplace Approximation Thompson Sampling Among the most popular bandit algorithms, Thompson sampling (Thompson, 1933; Chapelle & Li, 2011; Russo et al., 2018) is known to be simple and efficient in practice, which uses a Laplace approximation to approximate the posterior distribution of the data. After t 1 rounds of the bandit problem, assume we have collected data {x1, r1, x2, r2, . . . , xt 1, rt 1}. Define the following quantities based on historical data. s=1 xsx s , bt = s=1 rsxs, (2.2) where λ > 0 is a regularization parameter. Denote bθt = V 1 t bt. At round t, the agent receives an action set Xt Rd which consists of feature vectors of candidate actions at round t. Then linear Thompson sampling (Lin TS) (Agrawal & Goyal, 2013) samples a parameter eθt from distribution N(bθt, vt V 1 t ) and then chooses the arm as follows xt = argmaxx Xt x eθt. After that, it observes the reward rt for round t. Based on newly collected action feature xt and reward rt, the quantities in (2.2) can be updated and the learning process proceeds to the next round. Approximating the posterior distribution using a Gaussian distribution is also called Laplace Thompson sampling (Chapelle & Li, 2011). Note that sampling from N(bθt, vt V 1 t ) is usually implemented as eθt = bθt + vt V 1/2 t ζ in practice, where ζ is sampled from N(0, I) and vt > 0 is a scaling parameter. The computation complexity of calculating V 1/2 is at least O(d3) with Cholesky decomposition, which is prohibitively high, especially for high-dimensional machine learning problems. On the other hand, the Gaussian distribution used in Thompson sampling might not be a good approximation of the posterior distribution for general bandit models with more complicated structures than linear contextual bandits. 3. Langevin Monte Carlo Thompson Sampling Algorithm 1 Langevin Monte Carlo Thompson Sampling (LMC-TS) 1: Input: step sizes {ηt > 0}t 1, inverse temperature parameters {βt}t 1, loss function Lt(θ), and reward model function f(x, θ). θ1,0 = 0, K0 = 0. 2: for t = 1, 2, . . . do 3: θt,0 = θt 1,Kt 1 4: for k = 1, . . . , Kt do 5: sample a standard normal vector ϵt,k N(0, I) 6: θt,k = θt,k 1 ηt Lt(θt,k 1) + q 2ηtβ 1 t ϵt,k 7: end for 8: Play arm xt = argmaxx Xt f(x, θt,Kt) and observe reward rt 9: end for In this paper, we propose the Langevin Monte Carlo Thompson Sampling (LMC-TS) algorithm, which is presented in Algorithm 1. Unlike existing work that use Laplace Approximation (Chapelle & Li, 2011; Agrawal & Goyal, 2013; Kveton et al., 2020; Zhang et al., 2021), which is essentially a Gaussian distribution, to approximate the unknown posterior distribution, we use Langevin Monte Carlo (Roberts & Tweedie, 1996; Bakry et al., 2014) to learn the exact posterior distribution of parameter θ up to a high precision. One closely related work to ours is Mazumdar et al. (2020) which proposed to combine LMC and SGLD with Thompson sampling algorithms in finite-armed bandit problems without any contextual features. Their analysis heavily depends on their assumption on prior distributions and is hard to extend to contextual bandits (with potentially infinite arms) even for the simplest linear contextual bandits, which we will discuss in further details in the next section. In specific, Algorithm 1 works as follows. At the t-th round of the algorithm, we run the following subroutine for Kt steps. For each k = 1, . . . , Kt, we have θt,k = θt,k 1 η Lt(θt,k 1) + q 2ηtβ 1 t ϵt,k, (3.1) where ϵt,k is an isotropic Gaussian random vector in Rd, η > 0 is a step size parameter, βt is the inverse temperature parameter, and Lt(θ) is loss function between the observed rewards {ri}i=1,...,t 1 and estimated rewards {f(x, θ)} that is specified by the user. (3.1) is called the Langevin Monte Carlo (LMC) method in the approximate sampling literature (Roberts & Tweedie, 1996; Bakry et al., 2014; Dalalyan, 2017b;a), which could be viewed as the Euler-Maruyama discretization of the following stochastic differential equation from physics called Langevin dynam- Langevin Monte Carlo for Contextual Bandits ics (Langevin, 1908): dθ(s) = Lt θ(s) ds + q 2β 1 t d B(s), (3.2) where s > 0 is a continuous time index, β > 0 is the inverse temperature parameter and B(t) Rd is a Brownian motion. It has been showed that under certain conditions on the drift term L(θ(t)), Langevin dynamics will converge to a unique stationary distribution π(dx) e βL(x)dx. Therefore, one can use (1) to approximately sample from an arbitrary distribution πt exp( βt Lt(θ)). Note that Algorithm 1 is applicable to various different contextual bandit settings if we choose the corresponding reward model f(x, θ) and log-density function Lt(θ). Another advantage of our algorithm is that only noisy gradient descent update is performed in order to do proper exploration in different bandit problems. Thus our LMC-TS is both flexible in design and is easy to implement in practice. In the next few subsections, we show that how we can instantiate our algorithm for linear contextual bandits, generalized linear bandits, and neural contextual bandits respectively. 3.1. Implication to Linear Contextual Bandits In linear contextual bandits, it is assumed that the reward generating function is f(x) = x θ for all x X. Define the following loss function x i θ ri 2 + λ θ 2, (3.3) where λ > 0 is a regularization parameter. Then we have the gradient of Lt(θ) as Lt(θ) = 2(Vtθ bt), where Vt and bt are defined in the same way as in (2.2). Based on the linear bandit model and the loss function chosen in (3.3), we can show that the inner loop of Algorithm 1 generates samples approximately from the Gaussian posterior distribution. Proposition 3.1. If the epoch length Kt in Algorithm 1 is sufficiently large, the distribution of Kt converges to Gaussian distribution N(V 1 t bt, β 1 t V 1 t ) up to an arbitrary accuracy. Note that LMC does not converge exactly to the posterior distribution but instead converges to it with an arbitrarily pre-chosen prevision parameter for large enough Kt. In the next section, we will show this will be sufficient for the proposed bandit algorithm to achieve a sublinear regret. Proof. According to Roberts & Tweedie (1996); Bakry et al. (2014), we know the Markov chain generated by Langevin dynamics (3.2) converges to a stationary distribution πt, which is defined as πt(θ) = Z 1 exp( βt Lt(θ)), where Z = R exp( βt Lt(θ))dθ is the normalization term. In the inner loop of Algorithm 1, we apply the LMC update defined in (3.1) which is a discretization of (3.2) and thus obtain another Markov chain {θt,k}k=0,1,.... A recent line of non-asymptotic analyses show that the Markov chain {θt,k}k=0,1,... generated by LMC converges to πt up to an arbitrary accuracy for (strongly)-convex Lt (Dalalyan, 2017b) and nonconvex Lt (Vempala & Wibisono, 2019) respectively, as long as the epoch length Kt of Algorithm 1 is large enough. In what follows, we show that πt is the Gaussian distribution we need in linear contextual bandits. By the definition in (3.3), we have x i θ ri 2 + λ θ 2 θ xix i θ θ, 2rixi + r2 i + λ θ 2 = θ Vtθ 2θ bt + where the last equality is due to (2.2). We denote bθt = argminθ Lt(θ) which is the solution of ridge regression (Hoerl & Kennard, 1970). It is easy to verify that the solution of the ridge regression problem has the form bθt = V 1 t bt. Then, we have θ bθt Vt θ bθt = θ Vtθ 2θ Vt bθt + bθ t Vt bθt = θ Vtθ 2θ bt + bθ t Vt bθt, which immediately implies that πt(θ) exp( βt Lt(θ)) exp βt θ bθt Vt θ bθt . Therefore, we can conclude that the distribution of θt,Kt converges to Gaussian distribution N(bθt, β 1 t V 1 t ). Although the update in Algorithm 1 is presented as a full gradient descent step plus an isotropic noise, one can also replace the full gradient Lt(θt,k 1) with a stochastic gradient or a variance reduced stochastic gradient of the loss function Lt(θt,k 1) calculated from a mini-batch of data, which leads to Stochastic Gradient Langevin Dynamics (SGLD) algorithm (Welling & Teh, 2011) and Stochastic Variance Reduced Gradient Langevin Dynamics (SVRG-LD) (Dubey et al., 2016; Xu et al., 2018). And similar results to Proposition 3.1 can be obtained by following the proofs in Dalalyan & Karagulyan (2019); Xu et al. (2018); Zou et al. (2021). Langevin Monte Carlo for Contextual Bandits Now we have shown that our Algorithm 1 is similar to the Thompson sampling algorithm derived from Laplace approximation of the posterior distribution in linear contextual bandits. Nevertheless, our Algorithm 1 is more preferable than Thompson sampling in practice since at each iteration of LMC-TS we only need the computation of first order information, which is much more computationally efficient than computing the Cholesky decomposition for sampling from a general multivariate normal distribution in Thompson sampling when the feature dimension is high as we discussed in Section 2. 3.2. Implication to Generalized Linear Bandits In generalized linear bandits (GLB), the true reward r for arm x Xt at round t is assumed to be from a generalized linear model (GLM) (Mc Cullagh & Nelder, 2019). Specifically, conditional on feature vector x, r follows an exponential family distribution with mean µ(x θ ), where θ is an unknown weight parameter that is shared across all arms, and µ( ) is called the link function. It is worth noting that generalized linear bandits cover a class of common bandit models used in practice. For instance, when µ(z) = z is the identity function, it reduces to linear contextual bandits; when µ(z) = 1/(1 + e z) is the sigmoid function, it reduces to the logistic bandits (Dong et al., 2019). Based on samples {x1, r1, . . . , xt 1, rt 1}, the negative log-likelihood function is defined as Lt(θ) = Pt 1 i=1(m(x aiθ) rix aiθ), where m(z) is twice differentiable and m (z) = µ(z) is the link function defined in the previous paragraph. Existing Thompson sampling based algorithms on generalized linear bandits (Filippi et al., 2010; Kveton et al., 2020; Ding et al., 2021) usually first solve the following MLE estimator ˆθt = argmax θ i=1 (rix aiθ m(x aiθ)), (3.4) and then construct a Laplace approximation of the underlying posterior distribution, which is given by Gaussian distribution N(bθt, a2( 2 Lt(bθt)) 1), where a > 0 is a scaling parameter. Similar to Thompson sampling for linear contextual bandits discussed in Section 2, sampling from N(bθt, a2( 2 Lt(bθt)) 1) is computationally inefficient in high dimensional applications. Moreover, due to the existence of the link function, the posterior itself is not necessary a Gaussian distribution, and thus the Laplace approximation might cause a fixed approximation error. In contrast, our LMC-TS algorithm can be easily applied to GLB by choosing the reward model as f(x, θ ) = µ(x θ ) and the loss function Lt(θ) as the following regularized negative log-likelihood function i=1 (m(x aiθ) rix aiθ) + λ θ 2 2, (3.5) where λ > 0 is a tuning parameter. Under some standard conditions on the link function in GLB, it could be shown that the posterior density πt exp( βt Lt(θ)) is strongly log-concave and log-smooth. Similar to the proof of Proposition 3.1, by recalling results in the study of Langevin Monte Carlo (Dalalyan, 2017b), we can show that the distribution of iterates θt,Kt in Algorithm 1 converges to the true posterior distribution πt if the epoch length Kt of the inner loop is sufficiently large. In addition, due to the simplicity of our algorithm, we only need to perform gradient descent based updates, which is computationally more efficient than Laplace approximation based Thompson sampling for generalized linear bandits (Kveton et al., 2020). 3.3. Implication to Neural Contextual Bandits Our algorithm is also applicable to more general contextual bandit problems, where the reward function f(x, θ ) is a neural network with x as its input and θ as the collection of all weight matrices. This type of bandit model is usually referred to as the neural contextual bandits in the literature (Riquelme et al., 2018; Zhou et al., 2020; Zhang et al., 2021; Xu et al., 2022). One possible choice of the function Lt(θ) used in Algorithm 1 is the squared loss: f(xi, θ) ri 2 + λ θ 2, (3.6) where λ > 0. Due to the flexibility in the choice of loss functions in our method, one can always choose another loss function Lt(θ) based on the belief in the prior and posterior distributions in specific applications to boost the empirical performance. This makes our method directly applicable to complicated deep learning applications. 4. Theoretical Analysis of LMC-TS for Linear Contextual Bandits In this section, we provide the regret analysis of our proposed LMC-TS algorithm when we apply it to a specific contextual bandit problem, viz., the linear contextual bandit. We first state the assumption on the details of the model. Assumption 4.1. There is an unknown parameter θ Rd such that for any arm x X Rd, the reward is r(x) = x θ + ξ, where ξ is assumed to be a R-sub Gaussian random variable for some constant R > 0. The following theorem states the regret bound of LMC-TS. Theorem 4.2. Under Assumption 4.1, we choose a linear reward model f(x, θ) = x θ in Algorithm 1. Let δ (0, 1). Langevin Monte Carlo for Contextual Bandits For any j = 1, 2, . . ., let the step size ηj = 1/(4λmax(Vt)), the epoch length Kj = κj log(3R p 2d T log(T 3/δ)), and the inverse temperature β 1 j = 4(R p d log(T 3/δ)), where κj = λmax(Vj)/λmin(Vj) is the condition number of Vj. Then with probability 1 δ, it holds that R(T) CRd log(1/δ) q d T log3(1 + T/(λd)), where C > 0 is an absolute constant that is independent of the problem. Note that the regret upper bound in Theorem 4.2 is in the order of e O(d d T) which matches the best result for Thompson sampling based algorithm in linear contextual bandits with infinite arms (Agrawal & Goyal, 2013). Moreover, if the arm set |Xt| N is finite in each round for some integer N, following a similar proof as in Agrawal & Goyal (2013), this regret bound can be improved to e O(d T), where a logarithmic dependence on the number of arms N replaces the additional term O( d) and is omitted in the e O( ) notation. This shows that LMC-TS is theoretically comparable to Laplace approximation based Thompson sampling in linear contextual bandits. Nevertheless, due to the fact that we add multiple noises in Algorithm 1 at different time steps, the coupling of these noises makes the regret analysis of existing Laplace approximation Thompson sampling not directly applicable to our case. In specific, it has been shown by Phan et al. (2019) that the approximation error caused by the posterior sampling step for Thompson sampling can yield a linear regret in general and thus we have to develop nontrivial proof techniques to achieve a sublinear regret for LMC-TS. Remark 4.3. The only existing work that study the sublinear regret of TS with approximate sampling is given by Mazumdar et al. (2020). Compared with their work which combines Langevin algorithms with Thompson sampling for multiarmed bandits, our analysis applies to a more general class of bandit problems which covers MAB as a special case. Moreover, we do not assume that the reward distribution of a single data point is strongly log-concave, which is hard to be justified in contextual bandits. Specifically, they assume that the reward r conditional on context xi has a distribution p(r|x, θ ) which is strongly log-concave w.r.t. both x and θ . However, this assumption is not satisfied even for Gaussian rewards. For instance, if r is a Gaussian reward with mean x θ and unit variance, then log p(r|x, θ ) is not strongly convex with respect to x Rd when the dimension d is larger than 1. In contrast, we only assume that the reward is sub Gaussian, which is among the most common assumptions in linear contextual bandits (Abbasi-Yadkori et al., 2011; Agrawal & Goyal, 2013). Remark 4.4. We note that the epoch length of the inner loop of Algorithm 1 depends on the condition number κj = λmax(Vj)/λmin(Vj) which could be O(j) in the worst case. This is due to that the optimization loss function Lt(θ) is the sum of t squared loss functions and we do not assume each loss function is strongly convex. Moreover, under certain assumption on the diversity of the arm set as is studied in Hamidi & Bayati (2020); Wu et al. (2020), the condition number will become O(1). On the other hand, if we apply Newton s method to minimize the loss function Lt(θ) in each round, we could get rid of the condition number κj in the dependence of the epoch length of the inner loop Kj. Lastly, we observe from our empirical study that LMC-TS often requires a small number of iterations to achieve a good performance. 5. Empirical Evaluation of LMC-TS In this section, we conduct experiments on both synthetic datasets and real-world datasets to show that the proposed algorithm achieves the best performance in terms of regret minimization and also is scalable to large-scale and highdimensional problems. All experiments are conducted on Amazon EC2 P3 instances with NVIDIA V100 GPUs and Broadwell E5-2686 v4 processors. Our implementation can be found at https://github.com/devzhk/LMCTS. Benchmarks and baseline algorithms As we discussed in Section 3, our LMC-TS algorithm is applicable to many different contextual bandit problems. Hence we compare it with baseline algorithms in different bandit settings including linear contextual bandits, logistic bandits, quadratic bandits, and neural bandits (also known as deep bandits) respectively. For linear bandit problems, we compare our algorithm with baseline linear bandit algorithms such as Lin UCB (Chu et al., 2011), Lin TS (Agrawal & Goyal, 2013), and the ϵ-greedy algorithm. For logistic bandit problems, we compare our algorithm with existing stateof-the-art algorithms for generalized linear bandits including UCB-GLM (Li et al., 2017), GLM-TSL (Kveton et al., 2020), SGD-TS (Ding et al., 2021), and ϵ-greedy. For quadratic bandits and neural bandits, we compare our algorithm additionally with Neural UCB (Zhou et al., 2020), Neural TS (Zhang et al., 2021), Neural-Lin UCB (Xu et al., 2022), and Neural ϵ-greedy (Riquelme et al., 2018) which applies ϵ-greedy exploration to a neural network reward model trained by SGD. 5.1. Simulation Study on Linear, Logistic, and Quadratic Contextual Bandits We first compare our algorithm with baseline methods on simulated bandit problems presented in Section 2 including linear bandits, logistic bandits, and quadratic bandits, where the true reward model f(x, θ ) is known but the weight parameter θ Rd is unknown. Throughout this subsection, the context feature dimension is d = 20, the size of the arm set at round t is |Xt| = 50, and the time horizon of all Langevin Monte Carlo for Contextual Bandits learning algorithms is T = 10000. Linear contextual bandits: We first generate a linear contextual dataset following the problem setup in Section 2. To simulate the bandit environment, we first generate θ Rd with each coordinate randomly sampled from N(0, 1) and then scale θ to unit norm. We consider the following two settings: (1) we have a fixed arm set X Rd that remains the same during the whole learning process; (2) we receive a new arm set Xt Rd at each round t. For the feature vectors, we generate x Rd with each coordinate randomly sampled from N(0, 1) and then scale each vector to unit norm. The true reward for an arm x is then generated by r(x) = θ x + ξ, where the noise ξ is sampled from N(0, σ2) with σ2 = 0.5. Logistic bandits: In the logistic bandit experiment, we follow the setting in Kveton et al. (2020) and consider the fixed arm setting where X Rd with the context dimension d = 20 and the size of arm set |X| = 50. Each contextual vector is randomly generated from N(0, I) and scaled to unit norm. The reward for arm x X is generated by a Bernoulli distribution, viz., r(x) Ber µ(θ x) , where θ Rd is sampled from N(0, I) and scaled to unit norm, and µ(v) = 1/(1 + exp( v)) is the logistic function. Quadratic bandits: Following the setting in Zhou et al. (2020), we generate a quadratic contextual bandit problem. We generate a changing arm set Xt Rd at each round. The context dimension d = 20. The size of arm set |Xt| = 50 at each round. Each contextual vector x Xt is randomly generated from N(0, I) and scaled to unit norm. At round t, the reward function for the chosen arm xt Xt is given by rt = 10 θ xt 2 + ξ, where θ Rd is sampled from N(0, I) and scaled to unit norm, and the noise ξ N(0, 1). Implementation details: For linear bandits, a linear reward model f(x, θ) = x θ is used in all algorithms. For Lin UCB, we set the UCB bonus parameter as νt = c d log t following Li et al. (2010) and find the best parameter c by performing a grid search. For Lin TS, we set the variance parameter as ν = c d log T following the theory in Agrawal & Goyal (2013), and pick the best hyperparameter constant c from a grid search. For ϵ-greedy, the exploration rate is c t at time t, where c is selected by a gird search. For LMC-TS, we set the step size ηt = η0 t as suggested in our theory and do a grid search for the constant η0 and the temperature parameter β 1. We fix the epoch length for the inner loop of our algorithm as Kt = 100 for all t. For logistic bandits, a generalized linear reward model f(x, θ) = µ(x θ) is used, where µ(v) = 1/(1+exp( v)). The MLE estimator in (3.4) is solved via SGD for UCBGLM, GLM-TSL, and ϵ-greedy. For UCB-GLM, and GLMTSL, we follow the same parameter setting proposed in Kveton et al. (2020). For LMC-TS, we use the loss function defined in (3.5), and the step size and temperature parameters are tuned in the same way as in linear bandits. For quadratic bandits, a 4-layer fully-connected neural network with width 20 is used in Neural UCB, Neural TS, Neural-Lin UCB, Neural ϵ-greedy, and LMC-TS. We try both Re LU and Leaky Re LU and pick the best activation function for each algorithm. Neural networks are all updated by 100 gradient descent steps every round. Following the original implementation of Neural TS and Neural UCB, the matrix inverse is approximated by the inverse of diagonal. We perform grid search for the regularization parameter λ and variance parameter ν in their work. To make a fair comparison, we first perform grid searches for the parameters of all algorithms. We then fix the best hyperparameters, and shuffle the order of the dataset and repeat experiments for 10 times with different random seeds. Results: We report the mean and the standard error of the accumulative regret of different algorithms over 10 runs on all simulated bandit problems in Figure 1. The results on linear contextual bandits are shown in Figure 1(a) with a fixed arm set and in Figure 1(b) with time-changing arm sets. Our method LMC-TS achieves the best performance in both settings, and the performance gain in the changing arm setting is slightly lower since it is a more challenging problem. The results for logistic bandits are shown in Figure 1(c), and LMC-TS again outperforms baseline methods. The results for quadratic bandits are shown in Figure 1(d). In this setting, Lin UCB and Lin TS works poorly due to their dependence on the linear bandit structure. All the other algorithms use a neural network to model the reward, and our method achieves significant lower regret. Neural-Lin UCB performs much worse than other baseline methods, possibly due to its insufficient exploration in this setting. 5.2. Real-World Datasets In this subsection, we consider UCI machine learning datasets (Dua & Graff, 2019) including Shuttle, Magic Telescope, Mushroom, and Covertype, and a high dimension image dataset CIFAR10 (Krizhevsky et al., 2009). The specifications of these datasets are summarized in Table 1. To use these N-class classification datasets for contextual bandit problems, we follow Riquelme et al. (2018); Kveton et al. (2020) to construct context vectors for different arms in the following way: given a data feature x Rd, we transform it into N contextual vectors x(1) = (x, 0, . . . , 0), . . ., x(N) = (0, . . . , 0, x) RNd. Only the arm x(j) where j matches the correct class of this data has reward 1 and all other arms have reward 0. Implementation details: For all methods using neural networks, we choose the best architecture between a two-layer neural network with width 100 and a four-layer neural net- Langevin Monte Carlo for Contextual Bandits 0 2000 4000 6000 8000 10000 Time step Lin TS Lin UCB Eps Greedy LMC-TS (a) Linear bandit (fixed arm set) 0 2000 4000 6000 8000 10000 Time step Lin TS Lin UCB Linear Eps Greedy LMC-TS (b) Linear bandit 0 2000 4000 6000 8000 10000 Time step GLMTSL UCBGLM Eps Greedy SGDTS LMC-TS (c) Logistic bandit 0 2000 4000 6000 8000 10000 Time step Lin TS Lin UCB Neural Eps Greedy Neural TS Neural UCB Neural Lin UCB LMC-TS (d) Quadratic bandit Figure 1. Regret comparison on simulated bandit problems. The mean and standard error are reported over 10 runs. Table 1. Specifications of real-world datasets used in this paper. Shuttle Magic Telescope Mushroom Covertype CIFAR10 NUMBER OF ATTRIBUTES 9 10 22 54 3 32 32 NUMBER OF ARMS 7 2 2 7 10 DIMENSION OF CONTEXT FEATURE 63 20 48 378 30720 NUMBER OF INSTANCES 58,000 19,020 8124 581,012 10,000 0 2000 4000 6000 8000 10000 Time step Lin TS Lin UCB Neural Eps Greedy Neural TS Neural UCB Neural Lin UCB LMC-TS (a) Shuttle 0 2000 4000 6000 8000 10000 Time step Lin TS Lin UCB Neural Eps Greedy Neural TS Neural UCB Neural Lin UCB LMC-TS (b) Magic Telescope 0 2000 4000 6000 8000 10000 Time step Lin TS Lin UCB Neural Eps Greedy Neural TS Neural UCB Neural Lin UCB LMC-TS (c) Mushroom 0 2000 4000 6000 8000 10000 12000 14000 Time step Lin TS Lin UCB Neural Eps Greedy Neural TS Neural UCB Neural Lin UCB LMC-TS (d) Covertype Figure 2. Regret comparison on UCI datasets. The mean and standard error are reported over 10 runs. 0 20000 40000 60000 80000 100000 Time step Neural Eps Greedy Neural TS Neural UCB Neural Lin UCB LMC-TS (a) Cumulative regret 0 20000 40000 60000 80000 100000 Time step Neural Eps Greedy Neural TS Neural UCB Neural Lin UCB LMC-TS (b) Averaged regret Figure 3. Regret comparison on CIFAR10. The mean and standard error are reported over 5 runs. work with width 50. Activation function is also selected by grid search over Re LU and Leaky Re LU. All baseline algorithms update the neural network by 100 gradient descent steps every round. The parameters of all methods are tuned in the same way as in Section 5.1 for quadratic bandits. Results: We plot the average regret with std over 10 repeats for all algorithms on UCI datasets in Figure 2. It can be seen that our algorithm LMC-TS consistently outperforms other neural network based algorithms. The results for linear bandit based approaches are the worst on all datasets. The results on CIFAR10 is displayed in Figure 3. Note that the dimension of this dataset is much higher than those of UCI datasets that are used in existing neural contextual bandit papers (Riquelme et al., 2018; Zhou et al., 2020; Zhang et al., 2021; Xu et al., 2022). We plot the cumulative regret in Figure 3(a) and the average regret in Figure 3(b), which shows that LMC-TS achieves a sublinear cumulative regret and significantly outperforms all baseline methods except Neural-Lin UCB which converges almost as fast as LMC-TS. This is due to the shallow exploration of Neural Lin UCB which only perform UCB exploration on the last layer parameter of the neural network. However, it also leads to a higher asymptotic regret than LMC-TS accoding to Figure 3(b), which explores over the whole parameter space. This shows the potential of our algorithm in high dimensional bandit problems that use deep neural networks. We also compare the runtime of our algorithm with Neural TS, which is a Laplace approximation based Thompson sampling algorithm, to demonstrate the computational efficiency of LMC-TS. The runtime for running 1000 rounds Langevin Monte Carlo for Contextual Bandits is shown in Table 2. We present both the total runtime of these rounds and the time only for arm selection. It can be seen that LMC-TS is more computationally efficient than Neural TS, which is due to the fact that we do not need to calculate the inverse of a high dimensional matrix or sample from a nonisotropic Gaussian distribution. Table 2. Runtime (seconds) on the CIFAR10 dataset for running 1000 rounds. TIME FOR ARM SELECTION TOTAL TIME NEURALTS 174.1 776.0 LMC-TS 9.0 662.7 6. Conclusions In this paper, we proposed the Langevin Monte Carlo Thompson Sampling (LMC-TS) algorithm, which uses Langevin Monte Carlo to approximately sample the model parameter in contextual bandit problems from the posterior distribution. Unlike existing Thompson sampling based bandit algorithms that need to construct a Laplace approximation of the posterior which has a fixed approximation error, our method can directly sample from the posterior distribution with an arbitrarily small approximation error. Moreover, our algorithm only requires to perform noisy gradient descent updates which is more computationally efficient than sampling from a high dimensional non-isotropic Gaussian distribution. It is also worth noting that the proposed LMC-TS algorithm works for a large class of contextual bandit problems where the sampling distribution may not necessarily be a conjugate posterior and the reward models are general functions. As a sanity check, we proved that LMC-TS enjoys the same regret upper bound as best Thompson sampling algorithms for linear contextual bandits, which theoretically demonstrates the competitiveness of LMC-TS in terms of regret minimization. Due to the efficiency and flexibility of our method, it would be an interesting and promising future direction to further investigate its theoretical performance in more general contextual bandits including generalized linear bandits and neural bandits. We also conducted numerical experiments on both synthetic datasets and real-world datasets to empirically evaluate the performance of our method. We compared our method LMC-TS with various baseline algorithms in linear contextual bandits, generalized linear bandits, and neural contextual bandits. We observed that our method consistently outperforms existing baseline algorithms in these settings respectively. Moreover, our method is much more scalable to high dimensional problems where deep neural networks are used owing to its simplicity and efficiency. Acknowledgements The authors would like to thank the anonymous reviewers for their invaluable comments. PX is supported by PIMCO Postdoctoral Fellowship. AA is partially supported by Bren Named Chair Professorship at Caltech. Abbasi-Yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24:2312 2320, 2011. Abramowitz, M. and Stegun, I. A. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1964. Agrawal, S. and Goyal, N. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39 1. JMLR Workshop and Conference Proceedings, 2012. Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pp. 127 135. PMLR, 2013. Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235 256, 2002. Bakry, D., Gentil, I., Ledoux, M., et al. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014. Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Machine Learning, 5(1):1 122, 2012. Chapelle, O. and Li, L. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24:2249 2257, 2011. Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208 214. JMLR Workshop and Conference Proceedings, 2011. Dalalyan, A. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Conference on Learning Theory, pp. 678 689. PMLR, 2017a. Dalalyan, A. S. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651 676, 2017b. Langevin Monte Carlo for Contextual Bandits Dalalyan, A. S. and Karagulyan, A. User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12): 5278 5311, 2019. Ding, Q., Hsieh, C.-J., and Sharpnack, J. An efficient algorithm for generalized linear bandit: Online stochastic gradient descent and thompson sampling. In International Conference on Artificial Intelligence and Statistics, pp. 1585 1593. PMLR, 2021. Dong, S., Ma, T., and Van Roy, B. On the performance of thompson sampling on logistic bandits. In Conference on Learning Theory, pp. 1158 1160. PMLR, 2019. Dua, D. and Graff, C. UCI machine learning repository, 2019. URL http://archive.ics.uci.edu/ml. Dubey, K. A., J Reddi, S., Williamson, S. A., Poczos, B., Smola, A. J., and Xing, E. P. Variance reduction in stochastic gradient langevin dynamics. Advances in neural information processing systems, 29, 2016. Filippi, S., Cappe, O., Garivier, A., and Szepesv ari, C. Parametric bandits: The generalized linear case. In NIPS, volume 23, pp. 586 594, 2010. Hamidi, N. and Bayati, M. On worst-case regret of linear thompson sampling. ar Xiv preprint ar Xiv:2006.06790, 2020. Hoerl, A. E. and Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55 67, 1970. Jin, T., Xu, P., Shi, J., Xiao, X., and Gu, Q. Mots: Minimax optimal thompson sampling. In International Conference on Machine Learning, pp. 5074 5083. PMLR, 2021. Kaufmann, E., Korda, N., and Munos, R. Thompson sampling: An asymptotically optimal finite-time analysis. In International conference on algorithmic learning theory, pp. 199 213. Springer, 2012. Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009. Kveton, B., Zaheer, M., Szepesvari, C., Li, L., Ghavamzadeh, M., and Boutilier, C. Randomized exploration in generalized linear bandits. In International Conference on Artificial Intelligence and Statistics, pp. 2066 2076. PMLR, 2020. Lale, S., Azizzadenesheli, K., Anandkumar, A., and Hassibi, B. Stochastic linear bandits with hidden low rank structure. ar Xiv preprint ar Xiv:1901.09490, 2019. Langevin, P. On the theory of brownian motion. CR Acad. Sci. Paris, 146:530 533, 1908. Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96 1, 2007. Lattimore, T. and Szepesv ari, C. Bandit algorithms. Cambridge University Press, 2020. Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661 670, 2010. Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In International Conference on Machine Learning, pp. 2071 2080. PMLR, 2017. Mazumdar, E., Pacchiano, A., Ma, Y., Jordan, M., and Bartlett, P. On approximate thompson sampling with Langevin algorithms. In Proceedings of the 37th International Conference on Machine Learning, pp. 6797 6807. PMLR, 2020. URL https://proceedings.mlr. press/v119/mazumdar20a.html. Mc Cullagh, P. and Nelder, J. A. Generalized linear models. Routledge, 2019. M enard, P. and Garivier, A. A minimax and asymptotically optimal algorithm for stochastic bandits. In International Conference on Algorithmic Learning Theory, pp. 223 237. PMLR, 2017. Phan, M., Abbasi Yadkori, Y., and Domke, J. Thompson sampling and approximate inference. Advances in Neural Information Processing Systems, 32:8804 8813, 2019. Riquelme, C., Tucker, G., and Snoek, J. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=Sy Ye6k-CW. Roberts, G. O. and Tweedie, R. L. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, pp. 341 363, 1996. Russo, D. and Van Roy, B. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39 (4):1221 1243, 2014. Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., and Wen, Z. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1):1 96, 2018. Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. Langevin Monte Carlo for Contextual Bandits Vempala, S. and Wibisono, A. Rapid convergence of the unadjusted langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019. Vershynin, R. Introduction to the non-asymptotic analysis of random matrices. ar Xiv preprint ar Xiv:1011.3027, 2010. Wang, Z. and Zhou, M. Thompson sampling via local uncertainty. In International Conference on Machine Learning, pp. 10115 10125. PMLR, 2020. Welling, M. and Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681 688. Citeseer, 2011. Wu, W., Yang, J., and Shen, C. Stochastic linear contextual bandits with diverse contexts. In International Conference on Artificial Intelligence and Statistics, pp. 2392 2401. PMLR, 2020. Xu, P., Chen, J., Zou, D., and Gu, Q. Global convergence of langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems, 31, 2018. Xu, P., Wen, Z., Zhao, H., and Gu, Q. Neural contextual bandits with deep representation and shallow exploration. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=xn YACQqua GV. Zhang, W., Zhou, D., Li, L., and Gu, Q. Neural thompson sampling. In International Conference on Learning Representations, 2021. URL https://openreview. net/forum?id=tk Ato Zkc Unm. Zhou, D., Li, L., and Gu, Q. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pp. 11492 11502. PMLR, 2020. Zou, D., Xu, P., and Gu, Q. Faster convergence of stochastic gradient langevin dynamics for non-log-concave sampling. In Uncertainty in Artificial Intelligence, pp. 1152 1162. PMLR, 2021. Langevin Monte Carlo for Contextual Bandits A. Proof of the Regret Bound for Linear Contextual Bandits In this section, we study the regret of Algorithm 1. Before we analyze the regret of the proposed algorithm, define x t = argmaxx Xt x θ to be the optimal arm at time step t. Define t(x) = x t θ x θ to be the gap between the expected reward of the best arm and an arbitrary arm x Xt at time t. We follow a similar proof in Agrawal & Goyal (2013) based on the notion of saturated and unsaturated arm sets, which are defined as follows respectively Saturated arms: St = {x Xt : t(x) > gt(x)}, (A.1) Unsaturated arms: Ut = {x Xt : t(x) gt(x)}, (A.2) where gt(x) is a function of the time index t, arm x, and historical data up to time t, which will be determined in later proofs. Intuitively, saturated arms are already well estimated and the suboptimal gap of these arms are large enough such that they can be easily distinguished from the best arm. Define the filtration Ft = {x1, r1, . . . , xt}. The following proposition implies that the distribution of the iterate θt,Kt of Algorithm 1 is a Gaussian distribution. Proposition A.1. Under filtration Ft 1, the parameter θt,Kt used in the decision at round t of Algorithm 1 follows a Gaussian distribution N(µt,Kt, Σt,Kt), where the mean vector and the covariance matrix are defined as µt,Kt = AKt t . . . Ak1 1 θ1,0 + i=1 AKt t . . . Aki+1 i+1 I AKt i bθi, (A.3) 1 βi AKt t . . . Aki+1 i+1 I A2Ki i Vi(I + Ai) 1Aki+1 i+1 . . . AKt t , (A.4) where Ai = I 2ηi Vi for i = 1, 2, . . .. It can be seen that the variance term depends more on the recent noise than those noises added in the history. The following lemma upper bounds the distance between the solution bθt of the ridge regression in (3.3) and the true weight parameter θ of the bandit model. Lemma A.2 (Agrawal & Goyal (2013)). Let δ (0, 1). For any x Rd, it holds that P x (bθt θ ) R p d log(t3/δ) + 1 x V 1 t 1 δ For the simplicity of notations, we denote g R(t) = R p d log(t3/δ) + 1 in the rest of the paper. This can be bounded using the same bound of the least square solution in Agrawal & Goyal (2013). Lemma A.3. Let R > 0. For a R-sub Gaussian random variable ξ, it holds that P(|ξ| > x) exp(1 x2/R2). In other words, P(|ξ| > R 1 + 2 log t) 1/t2 for any t > 1. Define the following event ER,t = |x bθt x θ | g R(t) x V 1 t , x Xt {|ξt| R p 1 + 2 log t}. (A.6) Then it holds that P(ER,t) 1 2/t2. We define the event ES,t as follows, which provides an upper bound of the distance between the iterate θt,k and the mode bθt of density πt exp( βt Lt(θ)). |x θt,k x bθt| The following lemma shows that ES,t happens with high probability. Lemma A.4. Under event ER,t, for any x Xt, with probability at least 1 1/t2 we have P(ES,t) 1 1/T 2. Langevin Monte Carlo for Contextual Bandits Lemma A.5 (Optimistic estimation). Based on the parameter estimation at time step t, the best arm is over estimated. In particular, let Kj κj log(3R p 2d T log(T 3/δ)) and 1/βj = 4(R p d log(T 3/δ) + 2) for all j [T]. Then conditional on event ER,t, we have P x t θt,k > x t θ 1 2 2eπ . (A.8) Lemma A.6. The probability of playing unsaturated arms is at least a constant. That is, P(xt Ut|ER,t) 1 2 The following lemma is a combination of Lemma 10 and Lemma 11 in Abbasi-Yadkori et al. (2011), which is standard in the analysis of linear contextual bandits. Lemma A.7. Let {xt} t=1 be a sequence in Rd and λ > 0. Suppose xt 2 1 and λ 1. Define Vt = λI + Pt s=1 xtx t . Then we have det(Vt) (λ + t/d)d, and t=1 xt 2 V 1 t 1 2 log det(VT ) det(λI) 2d log(1 + T/(λd)). Recall the definition of in (A.1) and (A.2), where gt(x) was not immediately specified. Now based on the knowledge of Lemmas A.2 and A.4, let us choose gt(x) as follows: gt(x) = R p d log(t3/δ) + 10Rd p log T log(T 3/δ) + 5/2 xt V 1 t . (A.9) Proof of Theorem 4.2. By definition, the regret of Algorithm 1 is R(T) = PT t=1 t(xt). At each time step, define xt = argminx Ut gt(x) to be the arm in the unsaturated arm set that has the smallest value x V 1 t . Then conditional on events ER,t and ES,t, we have t(xt) = x t θ x t θ = x t θ x t θ + x t θ x t θ gt( xt) + x t θt,Kt + gt( xt) x t θt,Kt gt(xt) 2gt( xt) + gt(xt), (A.10) where the first inequality is due to xt Ut, Lemma A.2 and Lemma A.4 respectively, and the last inequality is due to the choice of xt in our algorithm. We denote p = 1/(2 2eπ). Note that gt(x) > 0. We have E[gt(xt)|Ft, ER,t] = E[gt(xt)|Ft, ER,t, xt Ut]P(xt Ut) + E[gt(xt)|Ft, ER,t, xt St]P(xt St) (p 1/t2)gt( xt), (A.11) where the inequality holds due to the definition of xt and Lemma A.6. Therefore, we have E[ t(xt)|Ft] = E[ t(xt)|Ft, ER,t]P(ER,t) + E[ t(xt)|Ft, Ec R(t)]P(Ec R(t)) E[2gt( xt) + gt(xt)|Ft, ER,t]P(ER,t) + P(Ec R(t)) 2 p 1/t2 + 1 E[gt(xt)|Ft, ER,t] + δ p E[gt(xt)|Ft, ER,t] + δ t2 , (A.12) where we assume t(x) 1 for any x X since both x and θ are bounded, and c > 0 is a constant. Define Yt = Pt s=1( s(xs) c/pgs(xs) δ/s2) and Y0 = 0. Then we obtain E[Yt Yt 1|Ft] 0, which implies that {Yt}t=0,1,... is a super martingale, corresponding a filtration Ft. Note that |Yt Yt 1| = | t(xt) c/pgt(xt) δ/t2| 3c/pgt(xt). (A.13) Langevin Monte Carlo for Contextual Bandits Let ϵ2 = 2 log(1/δ) PT t=1 9(c/p)2gt(xt)2. By Azuma-Hoeffding inequality in Lemma C.3, we know with probability at least 1 δ it holds that t=1 ( t(xt) c/pgt(xt) δ/t2) v u u t2 log(1/δ) t=1 9(c/p)2gt(xt)2, which immediately implies with probability at least 1 δ that R(T) (1 + 3 p 2 log(1/δ))c/p t=1 gt(xt) + Recall the definition of gt(xt) in (A.9), we further have with probability at least 1 δ it holds R(T) (1 + 3 p 2 log(1/δ))c/p d log(t3/δ) + 10Rd p log T log(T 3/δ) + 5/2 xt V 1 t + π2δ d T log T log(T 3/δ) log(1/δ) log(1 + T/(λd)) C0Rd log(1/δ) q d T log3(1 + T/(λd)), where in the first inequality we used the fact that P t=1 1/t2 = π2/6, the second inequality is due to Lemmas A.5 and A.7 and Cauchy inequality, and C0 is a constant independent of the problem. B. Proof of Technical Lemmas B.1. Proof of Proposition A.1 Proof of Proposition A.1. By the updating rule of θt,Kt in Algorithm 1, we have θt,Kt = θt,Kt 1 2ηt(Vtθt,Kt 1 bt) + q 2ηtβ 1 t ϵt,Kt = (I 2ηt Vt)θt,Kt 1 + 2ηtbt + q 2ηtβ 1 t ϵt,Kt = (I 2ηt Vt)Ktθt,0 + l=0 (I 2ηt Vt)l 2ηtbt + q 2ηtβ 1 t ϵt,k l = (I 2ηt Vt)Ktθt,0 + 2ηt l=0 (I 2ηt Vt)lbt + q l=0 (I 2ηt Vt)lϵt,Kt l. (B.1) Recall that in the inner loop of Algorithm 1, we use a warm start from previous iteration, namely θt,0 = θt 1,kt 1. To simplify the notation, we denote Ai = I 2ηi Vi for i = 1, 2, . . .. Note that Ai is symmetric and satisfies I I 2ηi Vi 0 if the step size is chosen such that 0 < ηi < 1/(2λmax(Vi)). Therefore, we further have θt,Kt = AKt t θt 1,kt 1 + I AKt t bθt + q l=0 Al tϵt,Kt l = AKt t . . . Ak1 1 θ1,0 + i=1 AKt t . . . Aki+1 i+1 I AKt i bθi + βi AKt t . . . Aki+1 i+1 l=0 Al iϵi,Ki l where in the first equality we used I + A + . . . + An 1 = (I An)(I A) 1 and V 1 t bt = bθt. Conditional on Ft 1 and the initialization θ1,0, we know that θt,Kt follows the Gaussian distribution N(µt,Kt, Σt,Kt). Based on the property of multivariate Gaussian distribution, if ϵ N(0, Id d), then we have Aϵ + µ N(µ, AA ) for any A Rd d and µ Rd. Then the mean vector is defined as µt,Kt = AKt t . . . Ak1 1 θ1,0 + i=1 AKt t . . . Aki+1 i+1 I AKt i bθi. (B.2) Langevin Monte Carlo for Contextual Bandits Similarly, the covariance matrix is defined as βi AKt t . . . Aki+1 i+1 l=0 A2l i Aki+1 i+1 . . . AKt t βi AKt t . . . Aki+1 i+1 I A2Ki i (I A2 i ) 1Aki+1 i+1 . . . AKt t (B.3) 1 βi AKt t . . . Aki+1 i+1 I A2Ki i Vi(I + Ai) 1Aki+1 i+1 . . . AKt t , (B.4) where we used the fact that I+A+A2 +. . .+An 1 = (I An)(I A) 1 = (I A) 1(I An) if A is symmetric. B.2. Proof of Lemma A.3 Proof of Lemma A.3. The first statement can be proved by standard concentration techniques (Vershynin, 2010). Note that P(Ec R,t) = P {|x bθt x θ | g R(t) x V 1 t , x Xt} {|ξt| R p 1 + 2 log t} P {|x bθt x θ | g R(t) x V 1 t , x Xt} + P {|ξt| R p 1 + 2 log t} . Substituting the first statement and the result in Lemma A.2 into the above inequality, we know that P(ER,t) 1 2/t2. B.3. Proof of Lemma A.4 Proof of Lemma A.4. For any x Rd, we can decompose x (θt,k bθt) as follows. x (θt,k bθt) = x (θt,k µt,k) + x (µt,k bθt). (B.5) Bounding term x (θt,k µt,k): we have x θt,k x µt,k x Σ1/2 t,Kt 2 Σ 1/2 t,Kt (θt,k µt,k) 2. According to Proposition A.1, Σ 1/2 t,Kt (θt,k µt,k) is a standard Gaussian random vector in Rd. Therefore, we have P Σ 1/2 t,Kt (θt,k µt,k) 2 p Note that when we choose ηi 1/(4λmax(Vi)) for all i, we have 1 2I Ai = I 2ηi Vi (1 2ηiλmin(Vi))I, 3 2I I + Ai = 2I 2ηi Vi 2I. (B.7) By definition Ai = I 2ηi Vi and Vi is symmetric. Therefore, Ai and V 1 i commute, which implies A2Ki i V 1 i = (I 2ηi Vi) . . . (I 2ηi Vi)(I 2ηi Vi)V 1 i = (I 2ηi Vi) . . . (I 2ηi Vi)V 1 i (I 2ηi Vi) = AKi i V 1 i AKi i . (B.8) Langevin Monte Carlo for Contextual Bandits Recall the definition of Σt,Kt in Proposition A.1, we have 1 βi x AKt t . . . Aki+1 i+1 I A2Ki i V 1 i (I + Ai) 1Aki+1 i+1 . . . AKt t x 2 3βi x AKt t . . . Aki+1 i+1 V 1 i AKi i V 1 i AKi i Aki+1 i+1 . . . AKt t x i=1 x AKt t . . . Aki+1 i+1 V 1 i V 1 i+1 Aki+1 i+1 . . . AKt t x 2 3βT x AKt t . . . Ak1 1 V 1 1 Ak1 1 . . . AKt t x + 2 3βT x V 1 t x, (B.9) where the fist inequality is due to (B.7), and the last equality is due to the choice of 1/βi = 1/βT for all i. By the definition in (2.2) and Sherman-Morrison formula, we have V 1 i V 1 i+1 = V 1 i Vi + xix i 1 = V 1 i xix i V 1 i 1 + xi 2 V 1 i , (B.10) which immediately implies x AKt t . . . Aki+1 i+1 V 1 i V 1 i+1 Aki+1 i+1 . . . AKt t x = x AKt t . . . Aki+1 i+1 V 1 i xix i V 1 i 1 + xi 2 V 1 i Aki+1 i+1 . . . AKt t x x AKt t . . . Aki+1 i+1 V 1 i xi 2 AKt t . . . Aki+1 i+1 V 1/2 i x 2 2 V 1/2 i xi 2 2 j=i+1 (1 2ηjλmin(Vj))2Kj xi 2 V 1 i x 2 V 1 i , (B.11) where we used 0 < 1/i x V 1 i 1 and the last inequality is due to (B.7). Therefore, we have x Σt,kx 2 3βT x 2 V 1 t + 2 3βT j=i+1 (1 2ηjλmin(Vj))2Kj xi 2 V 1 i x 2 V 1 i , (B.12) which immediately implies x Σt,Kt r 2 j=i+1 (1 2ηjλmin(Vj))Kj xi V 1 i x V 1 i Therefore, it holds that P x θt,k x µt,Kt 2ˆgt(x) p P x θt,k x µt,Kt 2 p d log t x Σt,Kt P x Σ1/2 t,Kt 2 Σ 1/2 t,Kt (θt,k µt,k) 2 2 p d log t x Σt,Kt t2 , (B.13) where the last inequality follows from (B.6). Langevin Monte Carlo for Contextual Bandits Bounding term x (µt,k bθt): recall µt,k defined in (B.2), we have µt,Kt = AKt t . . . Ak1 1 θ1,0 + i=1 AKt t . . . Aki+1 i+1 I AKt i bθi = AKt t . . . Ak1 1 θ1,0 + i=1 AKt t . . . Aki+1 i+1 bθi bθi+1 AKt t . . . Ak1 1 bθ1 + bθt = AKt t . . . Ak1 1 (θ1,0 bθ1) + i=1 AKt t . . . Aki+1 i+1 bθi bθi+1) + bθt, (B.14) which immediately implies x (µt,Kt bθt) = x AKt t . . . Ak1 1 θ1,0 bθ1 i=1 AKt t . . . Aki+1 i+1 bθi bθi+1) For term I1, recall the inequalities in (B.7), we have x AKt t . . . Ak1 1 θ1,0 bθ1 i=1 (1 2ηiλmin(Vi))Ki x 2 θ1,0 bθ1 2. (B.16) Recall that in Algorithm 1, we choose θ1,0 = 0 and bθ1 = V 1 1 b1 = 0. We have I1 = 0. Now let s look at term I2. Note that by Sherman Morrison formula we have bθt+1 bθt = V 1 t+1bt+1 V 1 t bt = V 1 t xt rt x t bθt 1 + x t V 1 t xt = V 1 t xt x t θ x t bθt 1 + x t V 1 t xt + V 1 t xtξt 1 + x t V 1 t xt . Therefore, I2 can be bounded as follows. i=1 AKt t . . . Aki+1 i+1 V 1 i xi ξi + x i θ bθt 1 + x i V 1 i xi x AKt t . . . Aki+1 i+1 V 1 i xi ξi + x i θ bθt 1 + x i V 1 i xi x AKt t . . . Aki+1 i+1 V 1 i xi |ξi| + x i θ bθt i=1 AKt t . . . Aki+1 i+1 x V 1 i xi V 1 i |ξi| + x i θ bθt j=i+1 (1 2ηjλmin(Vj))Kj x V 1 i xi V 1 i |ξi| + x i θ bθt . (B.17) Note that under event ER,t, by Lemma A.3, we have x i θ bθt g R(t) xi V 1 t g R(t) and that |ξi| R 1 + 2 log t. Combining the above results, we have |x (µt,Kt bθt)| j=i+1 (1 2ηjλmin(Vj))Kj xi V 1 i x V 1 i R p 1 + 2 log t + g R(i) . (B.18) Langevin Monte Carlo for Contextual Bandits Substituting (B.13) and (B.18) into (B.5), we have with probability at least 1 1/t2 that |x (θt,Kt bθt)| j=i+1 (1 2ηjλmin(Vj))Kj R p 1 + 2 log t + g R(i) xi V 1 i x V 1 i j=i+1 (1 2ηjλmin(Vj))Kj xi V 1 i x V 1 i := Q, (B.19) where we denote the upper bound as Q. For any j [t], recall that we require ηj < 1/(2λmax(Vj)). Now we choose ηj = 1/(4λmax(Vj)). Then it holds that (1 2ηjλmin(Vj))Kj = (1 1/(2κj))Kj, where κj = λmax(Vj)/λmin(Vj). In order to ensure the above quantity be smaller than ϵ, we need Kj log(1/ϵ) log 1 1 1/(2κj) . (B.20) Note that e x > 1 x for any 0 < x < 1. Since 1/(2κj) 1/2, we have log(1/(1 1/(2κj))) 1/(2κj). Therefore, it suffices to set Kj 2κj log(1/ϵ) to ensure (1 1/(2κj))Kj ϵ. Recall the definition g R(t) = R p d log(t3/δ) + 1. Then we have 1 + 2 log t + g R(i) R p 2 log t + R p d log(t3/δ) + R + 1 2d log(t3/δ). By setting ϵ = (3R p 2dt log(t3/δ)) 1, we obtain i=1 ϵt i3R p 2d log(t3/δ) x 2 + 2 i=1 ϵt i x 2 i=1 ϵt 1 i x V 1 t + 2 i=1 ϵt 1 i x V 1 t 3βT + 3/2 x V 1 t , (B.21) where the second inequality is due to x V 1 t 1/ t x 2, and the third inequality is due to Pt 1 i=1 ϵt 1 i = Pt 2 i=0 ϵi < 1/(1 ϵ) 3/2. Therefore, it holds that P(ES,t) P x (θt,k bθt) Q where the last inequality is due to (B.19). B.4. Proof of Lemma A.5 Proof of Lemma A.5. Based on the mean and covariance matrix defined in (B.2) and (B.3), we have that x t θt,k follows the distribution N(x t µt,k, x t Σt,kx t ). By Lemma C.1, we have P x t θt,k > x t θ = P x t θt,k x t µt,k q x t Σt,kx t > x t θ x t µt,k q x t Σt,kx t 2π e Z2 t /2, (B.22) Langevin Monte Carlo for Contextual Bandits where the inequality holds when |Zt| < 1 and we define Zt = x t θ x t µt,k / q x t Σt,kx t . In the rest of the proof, we will show that |Zt| < 1 under event ER,t. First note that by triangle inequality we have |x t θ x t µt,k| |x t θ x t bθt| + |x t bθt x t µt,k| j=i+1 (1 2ηjλmin(Vj))Kj xi V 1 i x t V 1 i R p 1 + 2 log t + g R(i) + g R(t) x t V 1 t , (B.23) where in the second inequality we used the conclusion in Lemma A.3 since event ER,t holds and (B.18). When we choose Kj κj log(3R p 2dt log(t3/δ)), we further have |x t θ x t µt,k| i=1 t (t i) x t 2 R p 1 + 2 log t + g R(t) + g R(t) x t V 1 t 2dt log(t3/δ) x t 2 R p 1 + 2 log t + g R(t) + g R(t) x t V 1 t d log(t3/δ) + 2) x t V 1 t , (B.24) where in the last inequality we used the fact that g R(t) = R p d log(t3/δ) + 1. On the other hand, recall the definition of Σt,Kt in Proposition A.1. Following similar proof as in the previous lemma, we have x t Σt,kx t = 1 βi x t AKt t . . . Aki+1 i+1 I A2Ki i V 1 i (I + Ai) 1Aki+1 i+1 . . . AKt t x t 1 2βi x t AKt t . . . Aki+1 i+1 I A2Ki i V 1 i Aki+1 i+1 . . . AKt t x t . (B.25) Note that by definition Ai = I 2ηi Vi and Vi is symmetric. Therefore, Ai and V 1 i commute, and it holds that A2Ki i V 1 i = (I 2ηi Vi) . . . (I 2ηi Vi)(I 2ηi Vi)V 1 i = (I 2ηi Vi) . . . (I 2ηi Vi)V 1 i (I 2ηi Vi) = AKi i V 1 i AKi i . (B.26) Hence we have x t Σt,kx t 1 2βi x t AKt t . . . Aki+1 i+1 V 1 i AKi i V 1 i AKi i Aki+1 i+1 . . . AKt t x t i=1 x t AKt t . . . Aki+1 i+1 V 1 i V 1 i+1 Aki+1 i+1 . . . AKt t x t 1 2βT x t AKt t . . . Ak1 1 V 1 1 Ak1 1 . . . AKt t x t + 1 2βT x t V 1 t x t . (B.27) where we used the choice of 1/βi = 1/βT for all i. By the definition in (2.2) and Sherman-Morrison formula, we have V 1 i V 1 i+1 = V 1 i Vi + xix i 1 = V 1 i xix i V 1 i 1 + xi 2 V 1 i , (B.28) Langevin Monte Carlo for Contextual Bandits which immediately implies x t AKt t . . . Aki+1 i+1 V 1 i V 1 i+1 Aki+1 i+1 . . . AKt t x t = x t AKt t . . . Aki+1 i+1 V 1 i xix i V 1 i 1 + xi 2 V 1 i Aki+1 i+1 . . . AKt t x t x t AKt t . . . Aki+1 i+1 V 1 i xi 2 AKt t . . . Aki+1 i+1 V 1/2 i x t 2 2 V 1/2 i xi 2 2 j=i+1 (1 2ηjλmin(Vj))2Kj xi 2 V 1 i x t 2 V 1 i , where we used 0 < 1/i x t V 1 i 1. Therefore, we have x t Σt,Ktx t 1 2βT x t V 1 t x t 1 2βT i=1 (1 2ηiλmin(Vi))2Ki x t 2 V 1 1 j=i+1 (1 2ηjλmin(Vj))2Kj xi 2 V 1 i x t 2 V 1 i . (B.29) Similar to the proof of Lemma A.4, when we choose Kj κj log(3 t), we have x t Σt,Kt 1 2βT x t V 1 t x t 2 (3 t)t i x t 2 x t V 1 t 1 3 t x t 2 1 6 1 4βT x t V 1 t , (B.30) where we used the fact that λmin(V 1 t ) 1/t. Therefore, according to (B.24) and (B.30), it holds that |Zt| = x atθ x atµt,k p d log(t3/δ) + 2 1/(4βT ) , (B.31) which implies |Zt| < 1 when β 1 t = 4R q δ + 8. This completes our proof. B.5. Proof of Lemma A.6 Proof of Lemma A.6. Since the algorithm chooses the arm to pull based on the estimated reward x θt,k, as long as we can find an arm in the unsaturated set that beats all arms in the saturated set, we will have xt Ut. Recall the definition in (A.2), we know that the best arm is in the unsaturated set, i.e., x t Ut. Therefore, it holds that {xt Ut} x t θt,k > x θt,k, x St . (B.32) Conditional on event ER,t, we have P x t θt,k > x t θ = P x t θt,k > x t θ |ES,t P(ES,t) + P x t θt,k > x t θ |Ec S,t P(Ec S,t) P x t θt,k > x t θ |ES,t + P(Ec S,t). (B.33) Recall the definition of the gap and the saturated set in (A.1), we have that x t θ = x θ + (t)(x) x θ + gt(x) for any x St, where gt(x) is defined as in (A.9). Then it holds that P x t θt,k > x t θ |ES,t P x t θt,k > x θ + gt(x), x St|ES,t P x t θt,k > x θt,k, x St|ES,t , (B.34) Langevin Monte Carlo for Contextual Bandits where the second inequality is true since |x (θt,k θ )| gt(x) based on the definition of events ER,t and ES,t in (A.6) and (A.7) respectively. Therefore, we have P(xt Ut) P x t θt,k > x t θ |ES,t P x t θt,k > x t θ P(Ec S,t(t)) where the last inequality holds due to Lemma A.4 and Lemma A.5. C. Auxiliary Lemmas Lemma C.1. (Abramowitz & Stegun, 1964) Suppose Z is a Gaussian random variable Z N(µ, σ2), where σ > 0. For 0 z 1, we have P(Z > µ + zσ) 1 2 , P(Z < µ zσ) 1 And for z 1, we have 2z π P(|Z µ| > zσ) e z2 Lemma C.2. (Vershynin, 2010) Let X be a σ2-sub Gaussian vector in Rd. Then we have E[ X 2] 4σ d. For δ (0, 1), with probability at least 1 δ that X 2 4σ The following lemma introduces the Azuma-Hoeffding inequality for super-martingale. Lemma C.3. Suppose {Xk}k=0,1,... is a super-martingale and satisfies |Xk+1 Xk| < ck+1 for all k 0. Then for any ϵ > 0, we have P(XT X0 ϵ) exp ϵ2 2 PT t=1 c2 t