# ensemble_sampling__ff7acc70.pdf Ensemble Sampling Xiuyuan Lu Stanford University lxy@stanford.edu Benjamin Van Roy Stanford University bvr@stanford.edu Thompson sampling has emerged as an effective heuristic for a broad range of online decision problems. In its basic form, the algorithm requires computing and sampling from a posterior distribution over models, which is tractable only for simple special cases. This paper develops ensemble sampling, which aims to approximate Thompson sampling while maintaining tractability even in the face of complex models such as neural networks. Ensemble sampling dramatically expands on the range of applications for which Thompson sampling is viable. We establish a theoretical basis that supports the approach and present computational results that offer further insight. 1 Introduction Thompson sampling [8] has emerged as an effective heuristic for trading off between exploration and exploitation in a broad range of online decision problems. To select an action, the algorithm samples a model of the system from the prevailing posterior distribution and then determines which action maximizes expected immediate reward according to the sampled model. In its basic form, the algorithm requires computing and sampling from a posterior distribution over models, which is tractable only for simple special cases. With complex models such as neural networks, exact computation of posterior distributions becomes intractable. One can resort to to the Laplace approximation, as discussed, for example, in [2, 5], but this approach is suitable only when posterior distributions are unimodal, and computations become an obstacle with complex models like neural networks because compute time requirements grow quadratically with the number of parameters. An alternative is to leverage Markov chain Monte Carlo methods, but those are computationally onerous, especially when the model is complex. A practical approximation to Thompson sampling that can address complex models and problems requiring frequent decisions should facilitate fast incremental updating. That is, the time required per time period to learn from new data and generate a new sample model should be small and should not grow with time. Such a fast incremental method that builds on the Laplace approximation concept is presented in [5]. In this paper, we study a fast incremental method that applies more broadly, without relying on unimodality. As a sanity check we offer theoretical assurances that apply to the special case of linear bandits. We also present computational results involving simple bandit problems as well as complex neural network models that demonstrate efficacy of the approach. Our approach is inspired by [6], which applies a similar concept to the more complex context of deep reinforcement learning, but without any theoretical analysis. The essential idea is to maintain and incrementally update an ensemble of statistically plausible models, and to sample uniformly from this set in each time period as an approximation to sampling from the posterior distribution. Each model is initially sampled from the prior, and then updated in a manner that incorporates data and random perturbations that diversify the models. The intention is for the ensemble to approximate the posterior distribution and the variance among models to diminish as the posterior concentrates. We refine this methodology and bound the incremental regret relative to exact Thompson sampling for a broad class 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. of online decision problems. Our bound indicates that it suffices to maintain a number of models that grows only logarithmically with the horizon of the decision problem, ensuring computational tractability of the approach. 2 Problem formulation We consider a broad class of online decision problems to which Thompson sampling could, in principle, be applied, though that would typically be hindered by intractable computational requirements. We will define random variables with respect to a probability space (Ω, F, P) endowed with a filtration (Ft : t = 0, . . . , T). As a convention, random variables we index by t will be Ft-measurable, and we use Pt and Et to denote probabilities and expectations conditioned on Ft. The decision-maker chooses actions A0, . . . , AT 1 A and observes outcomes Y1, . . . , YT Y. There is a random variable θ, which represents a model index. Conditioned on (θ, At 1), Yt is independent of Ft 1. Further, P(Yt = y|θ, At 1) does not depend on t. This can be thought of as a Bayesian formulation, where randomness in θ reflects prior uncertainty about which model corresponds to the true nature of the system. We assume that A is finite and that each action At is chosen by a randomized policy π = (π0, . . . , πT 1). Each πt is Ft-measurable, and each realization is a probability mass function over actions A; At is sampled independently from πt. The agent associates a reward R(y) with each outcome y Y, where the reward function R is fixed and known. Let Rt = R(Yt) denote the reward realized at time t. Let Rθ(a) = E [R(Yt)|θ, At 1 = a]. Uncertainty about θ induces uncertainty about the true optimal action, which we denote by A arg max a A Rθ(a). Let R = Rθ(A ). The T-period conditional regret when the actions (A0, .., AT 1) are chosen according to π is defined by Regret(T, π, θ) = E t=1 (R Rt) θ where the expectation is taken over the randomness in actions At and outcomes Yt, conditioned on θ. We illustrate with a couple of examples that fit our formulation. Example 1. (linear bandit) Let θ be drawn from ℜN and distributed according to a N(µ0, Σ0) prior. There is a set of K actions A ℜN. At each time t = 0, 1, . . . , T 1, an action At A is selected, after which a reward Rt+1 = Yt+1 = θ At + Wt+1 is observed, where Wt+1 N(0, σ2 w). Example 2. (neural network) Let gθ : ℜN 7 ℜK denote a mapping induced by a neural network with weights θ. Suppose there are K actions A ℜN, which serve as inputs to the neural network, and the goal is to select inputs that yield desirable outputs. At each time t = 0, 1, . . . , T 1, an action At A is selected, after which Yt+1 = gθ(At) + Wt+1 is observed, where Wt+1 N(0, σ2 w I). A reward Rt+1 = R(Yt+1) is associated with each observation. Let θ be distributed according to a N(µ0, Σ0) prior. The idea here is that data pairs (At, Yt+1) can be used to fit a neural network model, while actions are selected to trade off between generating data pairs that reduce uncertainty in neural network weights and those that offer desirable immediate outcomes. 3 Algorithms Thompson sampling offers a heuristic policy for selecting actions. In each time period, the algorithm samples an action from the posterior distribution pt(a) = Pt(A = a) of the optimal action. In other words, Thompson sampling uses a policy πt = pt. It is easy to see that this is equivalent to sampling a model index ˆθt from the posterior distribution of models and then selecting an action At = arg max a A Rˆθt(a) that optimizes the sampled model. Thompson sampling is computationally tractable for some problem classes, like the linear bandit problem, where the posterior distribution is Gaussian with parameters (µt, Σt) that can be updated incrementally and efficiently via Kalman filtering as outcomes are observed. However, when dealing with complex models, like neural networks, computing the posterior distribution becomes intractable. Ensemble sampling serves as an approximation to Thompson sampling for such contexts. Algorithm 1 Ensemble Sampling 1: Sample: θ0,1, . . . , θ0,M p0 2: for t = 0, . . . , T 1 do 3: Sample: m unif({1, . . . , M}) 4: Act: At = arg max a A 5: Observe: Yt+1 6: Update: θt+1,1, . . . , θt+1,M 7: end for The posterior can be interpreted as a distribution of statistically plausible models, by which we mean models that are sufficiently consistent with prior beliefs and the history of observations. With this interpretation in mind, Thompson sampling can be thought of as randomly drawing from the range of statistically plausible models. Ensemble sampling aims to maintain, incrementally update, and sample from a finite set of such models. In the spirit of particle filtering, this set of models approximates the posterior distribution. The workings of ensemble sampling are in some ways more intricate than conventional uses of particle filtering, however, because interactions between the ensemble of models and selected actions can skew the distribution. While elements of ensemble sampling require customization, a general template is presented as Algorithm 1. The algorithm begins by sampling M models from the prior distribution. Then, over each time period, a model is sampled uniformly from the ensemble, an action is selected to maximize expected reward under the sampled model, the resulting outcome is observed, and each of the M models is updated. To produce an explicit algorithm, we must specify a model class, prior distribution, and algorithms for sampling from the prior and updating models. For a concrete illustration, let us consider the linear bandit (Example 1). Though ensemble sampling is unwarranted in this case, since Thompson sampling is efficient, the linear bandit serves as a useful context for understanding the approach. Standard algorithms can be used to sample models from the N(µ0, Σ0) prior. One possible procedure for updating models maintains a covariance matrix, updating it according to Σt+1 = Σ 1 t + At A t /σ2 w 1 , and generates model parameters incrementally according to θt+1,m = Σt+1 Σ 1 t θt,m + At(Rt+1 + Wt+1,m)/σ2 w , for m = 1, . . . , M, where ( Wt,m : t = 1, . . . , T, m = 1, . . . , M) are independent N(0, σ2 w) random samples drawn by the updating algorithm. It is easy to show that the resulting parameter vectors satisfy θt,m = arg min ν τ=0 (Rτ+1 + Wτ+1,m A τ ν)2 + (ν θ0,m) Σ 1 0 (ν θ0,m) which admits an intuitive interpretation: each θt,m is a model fit to a randomly perturbed prior and randomly perturbed observations. As we establish in the appendix, for any deterministic sequence A0, . . . , At 1, conditioned on Ft, the models θt,1, . . . , θt,M are independent and identically distributed according to the posterior distribution of θ. In this sense, the ensemble approximates the posterior. It is not a new observation that, for deterministic action sequences, such a scheme generates exact samples of the posterior distribution (see, e.g., [7]). However, for stochastic action sequences selected by Algorithm 1, it is not immediately clear how well the ensemble approximates the posterior distribution. We will provide a bound in the next section which establishes that, as the number of models M increases, the regret of ensemble sampling quickly approaches that of Thompson sampling. The ensemble sampling algorithm we have described for the linear bandit problem motivates an analogous approach for the neural network model of Example 2. This approach would again begin with M models, with connection weights θ0,1, . . . , θ0,M sampled from a N(µ0, Σ0) prior. It could be natural here to let µ0 = 0 and Σ0 = σ2 0I for some variance σ2 0 chosen so that the range of probable models spans plausible outcomes. To incrementally update parameters, at each time t, each model m applies some number of stochastic gradient descent iterations to reduce a loss function of the form τ=0 (Yτ+1 + Wτ+1,m gν(Aτ))2 + (ν θ0,m) Σ 1 0 (ν θ0,m). We present computational results in Section 5.2 that demonstrate viability of this approach. 4 Analysis of ensemble sampling for the linear bandit Past analyses of Thompson sampling have relied on independence between models sampled over time periods. Ensemble sampling introduces dependencies that may adversely impact performance. It is not immediately clear whether the degree of degradation should be tolerable and how that depends on the number of models in the ensemble. In this section, we establish a bound for the linear bandit context. Our result serves as a sanity check for ensemble sampling and offers insight that should extend to broader model classes, though we leave formal analysis beyond the linear bandit for future work. Consider the linear bandit problem described in Example 1. Let πTS and πES denote the Thompson and ensemble sampling policies for this problem, with the latter based on an ensemble of M models, generated and updated according to the procedure described in Section 3. Let R = mina A θ a denote the worst mean reward and let (θ) = R R denote the gap between maximal and minimal mean rewards. The following result bounds the difference in regret as a function of the gap, ensemble size, and number of actions. Theorem 3. For all ϵ > 0, if ϵ2 log 4|A|T then Regret(T, πES, θ) Regret(T, πTS, θ) + ϵ (θ)T. This inequality bounds the regret realized by ensemble sampling by a sum of the regret realized by Thompson sampling and an error term ϵ (θ)T. Since we are talking about cumulative regret, the error term bounds the per-period degradation relative to Thompson sampling by ϵ (θ). The value of ϵ can be made arbitrarily small by increasing M. Hence, with a sufficiently large ensemble, the per-period loss will be small. This supports the viability of ensemble sampling. An important implication of this result is that it suffices for the ensemble size to grow logarithmically in the horizon T. Since Thompson sampling requires independence between models sampled over time, in a sense, it relies on T models one per time period. So to be useful, ensemble sampling should operate effectively with a much smaller number, and the logarithmic dependence is suitable. The bound also grows with |A| log |A|, which is manageable when there are a modest number of actions. We conjecture that a similar bound holds that depends instead on a multiple of N log N, where N is the linear dimension, which would offer a stronger guarantee when the number of actions becomes large or infinite, though we leave proof of this alternative bound for future work. The bound of Theorem 3 is on a notion of regret conditioned on the realization of θ. A Bayesian regret bound that removes dependence on this realization can be obtained by taking an expectation, integrating over θ: E Regret(T, πES, θ) E Regret(T, πTS, θ) + ϵE [ (θ)] T. We provide a complete proof of Theorem 3 in the appendix. Due to space constraints, we only offer a sketch here. Sketch of Proof. Let A denote an Ft-adapted action process (A0, . . . , AT 1). Our procedure for generating and updating models with ensemble sampling is designed so that, for any deterministic A, conditioned on the history of rewards (R1, . . . , Rt), models θt,1, . . . , θt,M that comprise the ensemble are independent and identically distributed according to the posterior distribution of θ. This can be verified via some algebra, as is done in the appendix. Recall that pt(a) denotes the posterior probability Pt(A = a) = P (A = a|A0, R1, . . . , At 1, Rt). To explicitly indicate dependence on the action process, we will use a superscript: pt(a) = p A t (a). Let ˆp A t denote an approximation to p A t , given by ˆp A t (a) = 1 M PM m=1 I a = arg maxa θ t,ma . Note that given an action process A, at time t Thompson sampling would sample the next action from p A t , while ensemble sampling would sample the next action from ˆp A t . If A is deterministic then, since θt,1, . . . , θt,M, conditioned on the history of rewards, are i.i.d. and distributed as θ, ˆp A t represents an empirical distribution of samples drawn from p A t . It follows from this and Sanov s Theorem that, for any deterministic A, P d KL(ˆp A t p A t ) ϵ|θ (M + 1)|A|e Mϵ. A naive application of the union bound over all deterministic action sequences would establish that, for any A (deterministic or stochastic), P d KL(ˆp A t p A t ) ϵ|θ P max a At d KL(ˆpa t pa t ) ϵ θ |A|t(M + 1)|A|e Mϵ However, our proof takes advantage of the fact that, for any deterministic A, p A t and ˆp A t do not depend on the ordering of past actions and observations. To make it precise, we encode the sequence of actions in terms of action counts c0, . . . , c T 1. In particular, let ct,a = |{τ t : Aτ = a}| be the number of times that action a has been selected by time t. We apply a coupling argument that introduces dependencies between the noise terms Wt and action counts, without changing the distributions of any observable variables. We let (Zn,a : n N, a A) be i.i.d. N(0, 1) random variables, and let Wt+1 = Zct,At,At. Similarly, we let ( Zn,a,m : n N, a A, m = 1, . . . , M) be i.i.d N(0, 1) random variables, and let Wt+1,m = Zct,At,At,m. To make explicit the dependence on A, we will use a superscript and write c A t to denote the action counts at time t when the action process is given by A. It is not hard to verify, as is done in the appendix, that if a, a AT are two deterministic action sequences such that ca t 1 = ca t 1, then pa t = pa t and ˆpa t = ˆpa t . This allows us to apply the union bound over action counts, instead of action sequences, and we get that for any A (deterministic or stochastic), P d KL(ˆp A t p A t ) ϵ|θ P max ca t 1:a At d KL(ˆpa t pa t ) ϵ θ (t + 1)|A|(M + 1)|A|e Mϵ. Now, we specialize the action process A to the action sequence At = AES t selected by ensemble sampling, and we will omit the superscripts in p A t and ˆp A t . We can decompose the per-period regret of ensemble sampling as E R θ At|θ = E (R θ At)I (d KL(ˆpt pt) ϵ) |θ + E (R θ At)I (d KL(ˆpt pt) < ϵ) |θ . (2) The first term can be bounded by E (R θ At)I (d KL(ˆpt pt) ϵ) |θ (θ)P (d KL(ˆpt pt) ϵ|θ) (θ)(t + 1)|A|(M + 1)|A|e Mϵ. To bound the second term, we will use another coupling argument that couples the actions that would be selected by ensemble sampling with those that would be selected by Thompson sampling. Let ATS t denote the action that Thompson sampling would select at time t. On {d KL(ˆpt pt) ϵ}, we have ˆpt pt TV 2ϵ by Pinsker s inequality. Conditioning on ˆpt and pt, if d KL(ˆpt pt) ϵ, we can construct random variables AES t and ATS t such that they have the same distributions as AES t and ATS t , respectively. Using maximal coupling, we can make AES t = ATS t with probability at least 1 1 2 ˆpt pt TV 1 p ϵ/2. Then, the second term of the sum in (2) can be decomposed into E (R θ At)I (d KL(ˆpt pt) ϵ) |θ = E h E h (R θ AES t )I d KL(ˆpt pt) ϵ, AES t = ATS t ˆpt, pt, θ i θ i +E h E h (R θ AES t )I d KL(ˆpt pt) ϵ, AES t = ATS t ˆpt, pt, θ i θ i , which, after some algebraic manipulations, leads to E (R θ At)I (d KL(ˆpt pt) < ϵ) |θ E R θ ATS t |θ + p ϵ/2 (θ). The result then follows from some straightforward algebra. 5 Computational results In this section, we present computational results that demonstrate viability of ensemble sampling. We will start with a simple case of independent Gaussian bandits in Section 5.1 and move on to more complex models of neural networks in Section 5.2. Section 5.1 serves as a sanity check for the empirical performance of ensemble sampling, as Thompson sampling can be efficiently applied in this case and we are able to compare the performances of these two algorithms. In addition, we provide simulation results that demonstrate how the ensemble size grows with the number of actions. Section 5.2 goes beyond our theoretical analysis in Section 4 and gives computational evidence of the efficacy of ensemble sampling when applied to more complex models such as neural networks. We show that ensemble sampling, even with a few models, achieves efficient learning and outperforms ϵ-greedy and dropout on the example neural networks. 5.1 Gaussian bandits with independent arms We consider a Gaussian bandit with K actions, where action k has mean reward θk. Each θk is drawn i.i.d. from N(0, 1). During each time step t = 0, . . . , T 1, we select an action k {1, . . . , K} and observe reward Rt+1 = θk + Wt+1, where Wt+1 N(0, 1). Note that this is a special case of Example 1. Since the posterior distribution of θ can be explicitly computed in this case, we use it as a sanity check for the performance of ensemble sampling. Figure 1a shows the per-period regret of Thompson sampling and ensemble sampling applied to a Gaussian bandit with 50 independent arms. We see that as the number of models increases, ensemble sampling better approximates Thompson sampling. The results were averaged over 2,000 realizations. Figure 1b shows the minimum number of models required so that the expected per-period regret of ensemble sampling is no more than ϵ plus the expected per-period regret of Thompson sampling at some large time horizon T across different numbers of actions. All results are averaged over 10,000 realizations. We chose T = 2000 and ϵ = 0.03. The plot shows that the number of models needed seems to grow sublinearly with the number of actions, which is stronger than the bound proved in Section 4. 0 100 200 300 400 500 600 700 t Per-period regret Ensemble sampling on an independent Gaussian bandit with 50 arms Thompson sampling 5 models 10 models 20 models 30 models 0 25 50 75 100 125 150 175 200 Number of actions Number of models Ensemble sampling on independent Gaussian bandits Figure 1: (a) Ensemble sampling compared with Thompson sampling on a Gaussian bandit with 50 independent arms. (b) Minimum number of models required so that the expected per-period regret of ensemble sampling is no more than ϵ = 0.03 plus the expected per-period regret of Thompson sampling at T = 2000 for Gaussian bandits across different numbers of arms. 5.2 Neural networks In this section, we follow Example 2 and show computational results of ensemble sampling applied to neural networks. Figure 2 shows ϵ-greedy and ensemble sampling applied to a bandit problem where the mapping from actions to expected rewards is represented by a neuron. More specifically, we have a set of K actions A ℜN. The mean reward of selecting an action a A is given by gθ(a) = max(0, θ a), where weights θ ℜN are drawn from N(0, λI). During each time period, we select an action At A and observe reward Rt+1 = gθ(At) + Zt+1, where Zt+1 N(0, σ2 z). We set the input dimension N = 100, number of actions K = 100, prior variance λ = 10, and noise variance σ2 z = 100. Each dimension of each action was sampled uniformly from [ 1, 1], except for the last dimension, which was set to 1. In Figure 3, we consider a bandit problem where the mapping from actions to expected rewards is represented by a two-layer neural network with weights θ (W1, W2), where W1 ℜD N and W2 ℜD. Each entry of the weight matrices is drawn independently from N(0, λ). There is a set of K actions A ℜN. The mean reward of choosing an action a A is gθ(a) = W 2 max(0, W1a). During each time period, we select an action At A and observe reward Rt+1 = gθ(At) + Zt+1, where Zt+1 N(0, σ2 z). We used N = 100 for the input dimension, D = 50 for the dimension of the hidden layer, number of actions K = 100, prior variance λ = 1, and noise variance σ2 z = 100. Each dimension of each action was sampled uniformly from [ 1, 1], except for the last dimension, which was set to 1. Ensemble sampling with M models starts by sampling θm from the prior distribution independently for each model m. At each time step, we pick a model m uniformly at random and apply the greedy action with respect to that model. We update the ensemble incrementally. During each time period, we apply a few steps of stochastic gradient descent for each model m with respect to the loss function τ=0 (Rτ+1 + Zτ+1,m gθ(Aτ))2 + 1 λ θ θm 2 2, where perturbations ( Zt,m : t = 1, . . . , T, m = 1, . . . , M) are drawn i.i.d. from N(0, σ2 z). Besides ensemble sampling, there are other heuristics for sampling from an approximate posterior distribution over neural networks, which may be used to develop approximate Thompson sampling. Gal and Ghahramani proposed an approach based on dropout [4] to approximately sample from a posterior over neural networks. In Figure 3, we include results from using dropout to approximate Thompson sampling on the two-layer neural network bandit. To facilitate gradient flow, we used leaky Re LUs of the form max(0.01x, x) internally in all agents, while the target neural nets still use regular Re LUs as described above. We took 3 stochastic gradient steps with a minibatch size of 64 for each model update. We used a learning rate of 1e-1 for ϵ-greedy and ensemble sampling, and a learning rate of 1e-2, 1e-2, 2e-2, and 5e-2 for dropout with dropping probabilities 0.25, 0.5, 0.75, and 0.9 respectively. All results were averaged over around 1,000 realizations. Figure 2 plots the per-period regret of ϵ-greedy and ensemble sampling on the single neuron bandit. We see that ensemble sampling, even with 10 models, performs better than ϵ-greedy with the best tuned parameters. Increasing the size of the ensemble further improves the performance. An ensemble of size 50 achieves orders of magnitude lower regret than ϵ-greedy. Figure 3a and 3b show different versions of ϵ-greedy applied to the two-layer neural network model. We see that ϵ-greedy with an annealing schedule tends to perform better than a fixed ϵ. Figure 3c plots the per-period regret of the dropout approach with different dropping probabilities, which seems to perform worse than ϵ-greedy. Figure 3d plots the per-period regret of ensemble sampling on the neural net bandit. Again, we see that ensemble sampling, with a moderate number of models, outperforms the other approaches by a significant amount. 6 Conclusion Ensemble sampling offers a potentially efficient means to approximate Thompson sampling when using complex models such as neural networks. We have provided an analysis that offers theoretical assurances for the case of linear bandit models and computational results that demonstrate efficacy with complex neural network models. We are motivated largely by the need for effective exploration methods that can efficiently be applied in conjunction with complex models such as neural networks. Ensemble sampling offers one approach 0 500 1000 1500 2000 instant regret (a) Epsilon-greedy epsilon=0.05 epsilon=0.1 epsilon=0.2 epsilon=50/(50+t) epsilon=150/(150+t) epsilon=300/(300+t) ensemble=10 ensemble=30 ensemble=50 0 500 1000 1500 2000 t (b) Ensemble sampling Figure 2: (a) ϵ-greedy and (b) ensemble sampling applied to a single neuron bandit. instant regret (a) Fixed epsilon epsilon=0.05 epsilon=0.1 epsilon=0.2 epsilon=0.3 epsilon=10/(10+t) epsilon=30/(30+t) epsilon=50/(50+t) epsilon=70/(70+t) dropout=0.25 dropout=0.5 dropout=0.75 dropout=0.9 ensemble=10 ensemble=30 ensemble=50 (b) Annealing epsilon 0 100 200 300 400 500 0 (c) Dropout 0 100 200 300 400 500 t (d) Ensemble sampling Figure 3: (a) Fixed ϵ-greedy, (b) annealing ϵ-greedy, (c) dropout, and (d) ensemble sampling applied to a two-layer neural network bandit. to representing uncertainty in neural network models, and there are others that might also be brought to bear in developing approximate versions of Thompson sampling [1, 4]. The analysis of various other forms of approximate Thompson sampling remains open. Ensemble sampling loosely relates to ensemble learning methods [3], though an important difference in motivation lies in the fact that the latter learns multiple models for the purpose of generating a more accurate model through their combination, while the former learns multiple models to reflect uncertainty in the posterior distribution over models. That said, combining the two related approaches may be fruitful. In particular, there may be practical benefit to learning many forms of models (neural networks, tree-based models, etc.) and viewing the ensemble as representing uncertainty from which one can sample. Acknowledgments This work was generously supported by a research grant from Boeing and a Marketing Research Award from Adobe. [1] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML 15, pages 1613 1622. JMLR.org, 2015. [2] Olivier Chapelle and Lihong Li. An empirical evaluation of Thompson sampling. In J. Shawe Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2249 2257. Curran Associates, Inc., 2011. [3] Thomas G Dietterich. Ensemble learning. The handbook of brain theory and neural networks, 2:110 125, 2002. [4] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050 1059, New York, New York, USA, 20 22 Jun 2016. PMLR. [5] Carlos Gómez-Uribe. Online algorithms for parameter mean and variance estimation in dynamic regression. ar Xiv preprint ar Xiv:1605.05697v1, 2016. [6] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4026 4034. Curran Associates, Inc., 2016. [7] George Papandreou and Alan L Yuille. Gaussian sampling by local perturbations. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 1858 1866. Curran Associates, Inc., 2010. [8] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285 294, 1933.