# deep_exploration_via_bootstrapped_dqn__68d0a108.pdf

Deep Exploration via Bootstrapped DQN

Ian Osband1,2, Charles Blundell2, Alexander Pritzel2, Benjamin Van Roy1

1Stanford University, 2Google Deep Mind {iosband, cblundell, apritzel}@google.com, bvr@stanford.edu

Eﬃcient exploration remains a major challenge for reinforcement learning (RL). Common dithering strategies for exploration, such as -greedy, do

not carry out temporally-extended (or deep) exploration; this can lead to exponentially larger data requirements. However, most algorithms for statistically eﬃcient RL are not computationally tractable in complex environments. Randomized value functions oﬀer a promising approach to eﬃcient exploration with generalization, but existing algorithms are not compatible with nonlinearly parameterized value functions. As a ﬁrst step towards addressing such contexts we develop bootstrapped DQN. We demonstrate that bootstrapped DQN can combine deep exploration with deep neural networks for exponentially faster learning than any dithering strategy. In the Arcade Learning Environment bootstrapped DQN substantially improves learning speed and cumulative performance across most games.

1 Introduction

We study the reinforcement learning (RL) problem where an agent interacts with an unknown

environment. The agent takes a sequence of actions in order to maximize cumulative rewards. Unlike standard planning problems, an RL agent does not begin with perfect knowledge of the environment, but learns through experience. This leads to a fundamental trade-oﬀ of exploration versus exploitation; the agent may improve its future rewards by exploring poorly understood states and actions, but this may require sacriﬁcing immediate rewards. To learn eﬃciently an agent should explore only when there are valuable learning opportunities. Further, since any action may have long term consequences, the agent should reason about the informational value of possible observation sequences. Without this sort of temporally extended (deep) exploration, learning times can worsen by an exponential factor.

The theoretical RL literature oﬀers a variety of provably-eﬃcient approaches to deep exploration [9]. However, most of these are designed for Markov decision processes (MDPs) with small ﬁnite state spaces, while others require solving computationally intractable planning tasks [8]. These algorithms are not practical in complex environments where an agent must generalize to operate eﬀectively. For this reason, large-scale applications of RL have relied upon statistically ineﬃcient strategies for exploration [12] or even no exploration at all [23]. We review related literature in more detail in Section 4.

Common dithering strategies, such as -greedy, approximate the value of an action by a single number. Most of the time they pick the action with the highest estimate, but sometimes they choose another action at random. In this paper, we consider an alternative approach to eﬃcient exploration inspired by Thompson sampling. These algorithms have some notion of uncertainty and instead maintain a distribution over possible values. They explore by randomly select a policy according to the probability it is the optimal policy. Recent work has shown that randomized value functions can implement something similar to Thompson sampling without the need for an intractable exact posterior update. However, this work is restricted to linearly-parameterized value functions [16]. We present a natural

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

extension of this approach that enables use of complex non-linear generalization methods such as deep neural networks. We show that the bootstrap with random initialization can produce reasonable uncertainty estimates for neural networks at low computational cost. Bootstrapped DQN leverages these uncertainty estimates for eﬃcient (and deep) exploration. We demonstrate that these beneﬁts can extend to large scale problems that are not designed

to highlight deep exploration. Bootstrapped DQN substantially reduces learning times and improves performance across most games. This algorithm is computationally eﬃcient and parallelizable; on a single machine our implementation runs roughly 20% slower than DQN.

2 Uncertainty for neural networks

Deep neural networks (DNN) represent the state of the art in many supervised and reinforcement learning domains [12]. We want an exploration strategy that is statistically computationally eﬃcient together with a DNN representation of the value function. To explore eﬃciently, the ﬁrst step to quantify uncertainty in value estimates so that the agent can judge potential beneﬁts of exploratory actions. The neural network literature presents a sizable body of work on uncertainty quantiﬁcation founded on parametric Bayesian inference [3, 7]. We actually found the simple non-parametric bootstrap with random initialization [5] more eﬀective in our experiments, but the main ideas of this paper would apply with any other approach to uncertainty in DNNs.

The bootstrap principle is to approximate a population distribution by a sample distribution [6]. In its most common form, the bootstrap takes as input a data set D and an estimator Â. To generate a sample from the bootstrapped distribution, a data set D of cardinality equal to that of D is sampled uniformly with replacement from D. The bootstrap sample estimate is then taken to be Â( D). The bootstrap is widely hailed as a great advance of 20th century applied statistics and even comes with theoretical guarantees [2]. In Figure 1a we present an eﬃcient and scalable method for generating bootstrap samples from a large and deep neural network. The network consists of a shared architecture with K bootstrapped heads branching oﬀindependently. Each head is trained only on its bootstrapped sub-sample of the data and represents a single bootstrap sample Â( D). The shared network learns a joint feature representation across all the data, which can provide signiﬁcant computational advantages at the cost of lower diversity between heads. This type of bootstrap can be trained eﬃciently in a single forward/backward pass; it can be thought of as a data-dependent dropout, where the dropout mask for each head is ﬁxed for each data point [19].

(a) Shared network architecture

(b) Gaussian process posterior

(c) Bootstrapped neural nets

Figure 1: Bootstrapped neural nets can produce reasonable posterior estimates for regression.

Figure 1 presents an example of uncertainty estimates from bootstrapped neural networks on a regression task with noisy data. We trained a fully-connected 2-layer neural networks with 50 rectiﬁed linear units (Re LU) in each layer on 50 bootstrapped samples from the data. As is standard, we initialize these networks with random parameter values, this induces an important initial diversity in the models. We were unable to generate eﬀective uncertainty estimates for this problem using the dropout approach in prior literature [7]. Further details are provided in Appendix A.

3 Bootstrapped DQN

For a policy ﬁwe deﬁne the value of an action a in state s Qﬁ(s, a) := Es,a,ﬁ[qŒ

t=1 trt], where œ (0, 1) is a discount factor that balances immediate versus future rewards rt. This expectation indicates that the initial state is s, the initial action is a, and thereafter actions

are selected by the policy ﬁ. The optimal value is Qú(s, a) := maxﬁQﬁ(s, a). To scale to large problems, we learn a parameterized estimate of the Q-value function Q(s, a; ) rather than a tabular encoding. We use a neural network to estimate this value.

The Q-learning update from state st, action at, reward rt and new state st+1 is given by

t+1 Ω t + (y Q

t Q(st, at; t))Ò Q(st, at; t) (1)

where is the scalar learning rate and y Q

t is the target value rt + maxa Q(st+1, a; ). are target network parameters ﬁxed = t.

Several important modiﬁcations to the Q-learning update improve stability for DQN [12]. First the algorithm learns from sampled transitions from an experience buﬀer, rather than learning fully online. Second the algorithm uses a target network with parameters that are copied from the learning network Ω t only every time steps and then kept ﬁxed in between updates. Double DQN [25] modiﬁes the target y Q

t and helps further1:

t Ω rt + max

st+1, arg max

a Q(st+1, a; t); "

Bootstrapped DQN modiﬁes DQN to approximate a distribution over Q-values via the bootstrap. At the start of each episode, bootstrapped DQN samples a single Q-value function from its approximate posterior. The agent then follows the policy which is optimal for that sample for the duration of the episode. This is a natural adaptation of the Thompson sampling heuristic to RL that allows for temporally extended (or deep) exploration [21, 13].

We implement this algorithm eﬃciently by building up K œ N bootstrapped estimates

of the Q-value function in parallel as in Figure 1a. Importantly, each one of these value function function heads Qk(s, a; ) is trained against its own target network Qk(s, a; ). This means that each Q1, .., QK provide a temporally extended (and consistent) estimate of the value uncertainty via TD estimates. In order to keep track of which data belongs to which bootstrap head we store ﬂags w1, .., w K œ {0, 1} indicating which heads are privy to which data. We approximate a bootstrap sample by selecting k œ {1, .., K} uniformly at random and following Qk for the duration of that episode. We present a detailed algorithm for our implementation of bootstrapped DQN in Appendix B.

4 Related work

The observation that temporally extended exploration is necessary for eﬃcient reinforcement learning is not new. For any prior distribution over MDPs, the optimal exploration strategy is available through dynamic programming in the Bayesian belief state space. However, the exact solution is intractable even for very simple systems[8]. Many successful RL applications focus on generalization and planning but address exploration only via ineﬃcient exploration [12] or even none at all [23]. However, such exploration strategies can be highly ineﬃcient.

Many exploration strategies are guided by the principle of optimism in the face of uncertainty (OFU). These algorithms add an exploration bonus to values of state-action pairs that

may lead to useful learning and select actions to maximize these adjusted values. This approach was ﬁrst proposed for ﬁnite-armed bandits [11], but the principle has been extended successfully across bandits with generalization and tabular RL [9]. Except for particular deterministic contexts [27], OFU methods that lead to eﬃcient RL in complex domains have been computationally intractable. The work of [20] aims to add an eﬀective bonus through a variation of DQN. The resulting algorithm relies on a large number of hand-tuned parameters and is only suitable for application to deterministic problems. We compare our results on Atari to theirs in Appendix D and ﬁnd that bootstrapped DQN oﬀers a signiﬁcant improvement over previous methods.

Perhaps the oldest heuristic for balancing exploration with exploitation is given by Thompson sampling [24]. This bandit algorithm takes a single sample from the posterior at every time step and chooses the action which is optimal for that time step. To apply the Thompson sampling principle to RL, an agent should sample a value function from its posterior. Naive applications of Thompson sampling to RL which resample every timestep can be extremely

1In this paper we use the DDQN update for all DQN variants unless explicitly stated.

ineﬃcient. The agent must also commit to this sample for several time steps in order to achieve deep exploration [21, 8]. The algorithm PSRL does exactly this, with state of the art guarantees [13, 14]. However, this algorithm still requires solving a single known MDP, which will usually be intractable for large systems.

Our new algorithm, bootstrapped DQN, approximates this approach to exploration via randomized value functions sampled from an approximate posterior. Recently, authors have proposed the RLSVI algorithm which accomplishes this for linearly parameterized value functions. Surprisingly, RLSVI recovers state of the art guarantees in the setting with tabular basis functions, but its performance is crucially dependent upon a suitable linear representation of the value function [16]. We extend these ideas to produce an algorithm that can simultaneously perform generalization and exploration with a ﬂexible nonlinear value function representation. Our method is simple, general and compatible with almost all advances in deep RL at low computational cost and with few tuning parameters.

5 Deep Exploration

Uncertainty estimates allow an agent to direct its exploration at potentially informative states and actions. In bandits, this choice of directed exploration rather than dithering generally categorizes eﬃcient algorithms. The story in RL is not as simple, directed exploration is not enough to guarantee eﬃciency; the exploration must also be deep. Deep exploration means exploration which is directed over multiple time steps; it can also be called planning to learn or far-sighted exploration. Unlike bandit problems, which balance actions which are immediately rewarding or immediately informative, RL settings require planning over several time steps [10]. For exploitation, this means that an eﬃcient agent must consider the future rewards over several time steps and not simply the myopic rewards. In exactly the same way, eﬃcient exploration may require taking actions which are neither immediately rewarding, nor immediately informative.

To illustrate this distinction, consider a simple deterministic chain {s 3, .., s+3} with three step horizon starting from state s0. This MDP is known to the agent a priori, with deterministic actions left and right . All states have zero reward, except for the leftmost state s 3 which has known reward > 0 and the rightmost state s3 which is unknown. In order to reach either a rewarding state or an informative state within three steps from s0 the agent must plan a consistent strategy over several time steps. Figure 2 depicts the planning and look ahead trees for several algorithmic approaches in this example MDP. The action left is gray, the action right is black. Rewarding states are depicted as red, informative

states as blue. Dashed lines indicate that the agent can plan ahead for either rewards or information. Unlike bandit algorithms, an RL agent can plan to exploit future rewards. Only an RL agent with deep exploration can plan to learn.

(a) Bandit algorithm

(b) RL+dithering

(c) RL+shallow explore

(d) RL+deep explore

Figure 2: Planning, learning and exploration in RL.

5.1 Testing for deep exploration We now present a series of didactic computational experiments designed to highlight the

need for deep exploration. These environments can be described by chains of length N > 3 in Figure 3. Each episode of interaction lasts N + 9 steps after which point the agent resets to the initial state s2. These are toy problems intended to be expository rather than entirely realistic. Balancing a well known and mildly successful strategy versus an unknown, but potentially more rewarding, approach can emerge in many practical applications.

Figure 3: Scalable environments that requires deep exploration.

These environments may be described by a ﬁnite tabular MDP. However, we consider algorithms which interact with the MDP only through raw pixel features. We consider two feature mappings 1hot(st) := (1{x = st}) and therm(st) := (1{x Æ st}) in {0, 1}N. We present results for therm, which worked better for all DQN variants due to better

generalization, but the diﬀerence was relatively small - see Appendix C. Thompson DQN is the same as bootstrapped DQN, but resamples every timestep. Ensemble DQN uses the same architecture as bootstrapped DQN, but with an ensemble policy.

We say that the algorithm has successfully learned the optimal policy when it has successfully

completed one hundred episodes with optimal reward of 10. For each chain length, we ran each learning algorithm for 2000 episodes across three seeds. We plot the median time to learn in Figure 4, together with a conservative lower bound of 99 + 2N 11 on the expected time to learn for any shallow exploration strategy [16]. Only bootstrapped DQN demonstrates a graceful scaling to long chains which require deep exploration.

Figure 4: Only Bootstrapped DQN demonstrates deep exploration.

5.2 How does bootstrapped DQN drive deep exploration? Bootstrapped DQN explores in a manner similar to the provably-eﬃcient algorithm PSRL [13] but it uses a bootstrapped neural network to approximate a posterior sample for the value. Unlike PSRL, bootstrapped DQN directly samples a value function and so does not require further planning steps. This algorithm is similar to RLSVI, which is also provably-eﬃcient [16], but with a neural network instead of linear value function and bootstrap instead of Gaussian sampling. The analysis for the linear setting suggests that this nonlinear approach will work well so long as the distribution {Q1, .., QK} remains stochastically optimistic [16], or at least as spread out as the correct posterior.

Bootstrapped DQN relies upon random initialization of the network weights as a prior to induce diversity. Surprisingly, we found this initial diversity was enough to maintain diverse generalization to new and unseen states for large and deep neural networks. This is eﬀective for our experimental setting, but will not work in all situations. In general it may be necessary to maintain some more rigorous notion of prior , potentially through the use of artiﬁcial prior data to maintain diversity [15]. One potential explanation for the eﬃcacy of simple random initialization is that unlike supervised learning or bandits, where all networks ﬁt the same data, each of our Qk heads has a unique target network. This, together with stochastic minibatch and ﬂexible nonlinear representations, means that even small diﬀerences at initialization may become bigger as they reﬁt to unique TD errors.

Bootstrapped DQN does not require that any single network Qk is initialized to the correct policy of right at every step, which would be exponentially unlikely for large chains N. For the algorithm to be successful in this example we only require that the networks generalize in a diverse way to the actions they have never chosen in the states they have not visited very often. Imagine that, in the example above, the network has made it as far as state N < N, but never observed the action right a = 2. As long as one head k imagines Q( N, 2) > Q( N, 2) then TD bootstrapping can propagate this signal back to s = 1 through the target network to drive deep exploration. The expected time for these estimates at n to propagate to at least one head grows gracefully in n, even for relatively small K, as our experiments show. We expand upon this intuition with a video designed to highlight how bootstrapped DQN demonstrates deep exploration https://youtu.be/e3Ku V_d0EMk. We present further evaluation on a diﬃcult stochastic MDP in Appendix C.

6 Arcade Learning Environment We now evaluate our algorithm across 49 Atari games on the Arcade Learning Environment

[1]. Importantly, and unlike the experiments in Section 5, these domains are not speciﬁcally designed to showcase our algorithm. In fact, many Atari games are structured so that small rewards always indicate part of an optimal policy. This may be crucial for the strong performance observed by dithering strategies2. We ﬁnd that exploration via bootstrapped DQN produces signiﬁcant gains versus -greedy in this setting. Bootstrapped DQN reaches peak performance roughly similar to DQN. However, our improved exploration mean we reach human performance on average 30% faster across all games. This translates to signiﬁcantly improved cumulative rewards through learning.

We follow the setup of [25] for our network architecture and benchmark our performance

against their algorithm. Our network structure is identical to the convolutional structure of DQN [12] except we split 10 separate bootstrap heads after the convolutional layer as per Figure 1a. Recently, several authors have provided architectural and algorithmic improvements to DDQN [26, 18]. We do not compare our results to these since their advances are orthogonal to our concern and could easily be incorporated to our bootstrapped DQN design. Full details of our experimental set up are available in Appendix D.

6.1 Implementing bootstrapped DQN at scale We now examine how to generate online bootstrap samples for DQN in a computationally

eﬃcient manner. We focus on three key questions: how many heads do we need, how should we pass gradients to the shared network and how should we bootstrap data online? We make signiﬁcant compromises in order to maintain computational cost comparable to DQN.

Figure 5a presents the cumulative reward of bootstrapped DQN on the game Breakout, for diﬀerent number of heads K. More heads leads to faster learning, but even a small number of heads captures most of the beneﬁts of bootstrapped DQN. We choose K = 10.

(a) Number of bootstrap heads K.

(b) Probability of data sharing p. Figure 5: Examining the sensitivities of bootstrapped DQN.

The shared network architecture allows us to train this combined network via backpropagation. Feeding K network heads to the shared convolutional network eﬀectively increases the learning rate for this portion of the network. In some games, this leads to premature and sub-optimal convergence. We found the best ﬁnal scores by normalizing the gradients by 1/K, but this also leads to slower early learning. See Appendix D for more details.

2By contrast, imagine that the agent received a small immediate reward for dying; dithering strategies would be hopeless at solving this problem, just like Section 5.

To implement an online bootstrap we use an independent Bernoulli mask w1,..,w K Ber(p) for each head in each episode3. These ﬂags are stored in the memory replay buﬀer and identify which heads are trained on which data. However, when trained using a shared minibatch the algorithm will also require an eﬀective 1/p more iterations; this is undesirable computationally. Surprisingly, we found the algorithm performed similarly irrespective of p and all outperformed DQN, as shown in Figure 5b. This is strange and we discuss this phenomenon in Appendix D. However, in light of this empirical observation for Atari, we chose p=1 to save on minibatch passes. As a result bootstrapped DQN runs at similar computational speed to vanilla DQN on identical hardware4.

6.2 Eﬃcient exploration in Atari

We ﬁnd that Bootstrapped DQN drives eﬃcient exploration in several Atari games. For

the same amount of game experience, bootstrapped DQN generally outperforms DQN with -greedy exploration. Figure 6 demonstrates this eﬀect for a diverse selection of games.

Figure 6: Bootstrapped DQN drives more eﬃcient exploration.

On games where DQN performs well, bootstrapped DQN typically performs better. Bootstrapped DQN does not reach human performance on Amidar (DQN does) but does on Beam Rider and Battle Zone (DQN does not). To summarize this improvement in learning time we consider the number of frames required to reach human performance. If bootstrapped DQN reaches human performance in 1/x frames of DQN we say it has improved by x. Figure 7 shows that Bootstrapped DQN typically reaches human performance signiﬁcantly faster.

Figure 7: Bootstrapped DQN reaches human performance faster than DQN.

On most games where DQN does not reach human performance, bootstrapped DQN does not solve the problem by itself. On some challenging Atari games where deep exploration is conjectured to be important [25] our results are not entirely successful, but still promising. In Frostbite, bootstrapped DQN reaches the second level much faster than DQN but network instabilities cause the performance to crash. In Montezuma s Revenge, bootstrapped DQN reaches the ﬁrst key after 20m frames (DQN never observes a reward even after 200m frames) but does not properly learn from this experience5. Our results suggest that improved exploration may help to solve these remaining games, but also highlight the importance of other problems like network instability, reward clipping and temporally extended rewards.

3p=0.5 is double-or-nothing bootstrap [17], p=1 is ensemble with no bootstrapping at all. 4Our implementation K=10, p=1 ran with less than a 20% increase on wall-time versus DQN. 5An improved training method, such as prioritized replay [18] may help solve this problem.

6.3 Overall performance Bootstrapped DQN is able to learn much faster than DQN. Figure 8 shows that bootstrapped DQN also improves upon the ﬁnal score across most games. However, the real beneﬁts to eﬃcient exploration mean that bootstrapped DQN outperforms DQN by orders of magnitude

in terms of the cumulative rewards through learning (Figure 9. In both ﬁgures we normalize performance relative to a fully random policy. The most similar work to ours presents several other approaches to improved exploration in Atari [20] they optimize for AUC-20, a normalized version of the cumulative returns after 20m frames. According to their metric, averaged across the 14 games they consider, we improve upon both base DQN (0.29) and their best method (0.37) to obtain 0.62 via bootstrapped DQN. We present these results together with results tables across all 49 games in Appendix D.4.

Figure 8: Bootstrapped DQN typically improves upon the best policy.

Figure 9: Bootstrapped DQN improves cumulative rewards by orders of magnitude.

6.4 Visualizing bootstrapped DQN We now present some more insight to how bootstrapped DQN drives deep exploration in Atari.

In each game, although each head Q1, .., Q10 learns a high scoring policy, the policies they ﬁnd are quite distinct. In the video https://youtu.be/Zm2Ko T82O_M we show the evolution of these policies simultaneously for several games. Although each head performs well, they each follow a unique policy. By contrast, -greedy strategies are almost indistinguishable for small values of and totally ineﬀectual for larger values. We believe that this deep exploration is key to improved learning, since diverse experiences allow for better generalization.

Disregarding exploration, bootstrapped DQN may be beneﬁcial as a purely exploitative policy. We can combine all the heads into a single ensemble policy, for example by choosing the action with the most votes across heads. This approach might have several beneﬁts. First, we ﬁnd that the ensemble policy can often outperform any individual policy. Second, the distribution of votes across heads to give a measure of the uncertainty in the optimal policy. Unlike vanilla DQN, bootstrapped DQN can know what it doesn t know. In an application where executing a poorly-understood action is dangerous this could be crucial. In the video https://youtu.be/0jv Ec C5Jv GY we visualize this ensemble policy across several games. We ﬁnd that the uncertainty in this policy is surprisingly interpretable: all heads agree at clearly crucial decision points, but remain diverse at other less important steps.

7 Closing remarks

In this paper we present bootstrapped DQN as an algorithm for eﬃcient reinforcement learning in complex environments. We demonstrate that the bootstrap can produce useful uncertainty estimates for deep neural networks. Bootstrapped DQN is computationally tractable and also naturally scalable to massive parallel systems. We believe that, beyond our speciﬁc implementation, randomized value functions represent a promising alternative to dithering for exploration. Bootstrapped DQN practically combines eﬃcient generalization with exploration for complex nonlinear value functions.

[1] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning

environment: An evaluation platform for general agents. ar Xiv preprint ar Xiv:1207.4708, 2012. [2] Peter J Bickel and David A Freedman. Some asymptotic theory for the bootstrap. The Annals

of Statistics, pages 1196 1217, 1981. [3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty

in neural networks. ICML, 2015. [4] Christoph Dann and Emma Brunskill. Sample complexity of episodic ﬁxed-horizon reinforcement

learning. In Advances in Neural Information Processing Systems, pages 2800 2808, 2015. [5] Bradley Efron. The jackknife, the bootstrap and other resampling plans, volume 38. SIAM,

1982. [6] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994. [7] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model

uncertainty in deep learning. ar Xiv preprint ar Xiv:1506.02142, 2015. [8] Arthur Guez, David Silver, and Peter Dayan. Eﬃcient bayes-adaptive reinforcement learning

using sample-based search. In Advances in Neural Information Processing Systems, pages 1025 1033, 2012. [9] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement

learning. Journal of Machine Learning Research, 11:1563 1600, 2010. [10] Sham Kakade. On the Sample Complexity of Reinforcement Learning. Ph D thesis, University

College London, 2003. [11] Tze Leung Lai and Herbert Robbins. Asymptotically eﬃcient adaptive allocation rules. Advances

in applied mathematics, 6(1):4 22, 1985. [12] Volodymyr et al. Mnih. Human-level control through deep reinforcement learning. Nature,

518(7540):529 533, 2015. [13] Ian Osband, Daniel Russo, and Benjamin Van Roy. (More) eﬃcient reinforcement learning via

posterior sampling. In NIPS, pages 3003 3011. Curran Associates, Inc., 2013. [14] Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pages 1466 1474, 2014. [15] Ian Osband and Benjamin Van Roy. Bootstrapped thompson sampling and deep exploration.

ar Xiv preprint ar Xiv:1507.00300, 2015. [16] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized

value functions. ar Xiv preprint ar Xiv:1402.0635, 2014. [17] Art B Owen, Dean Eckles, et al. Bootstrapping data arrays of arbitrary order. The Annals of

Applied Statistics, 6(3):895 927, 2012. [18] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.

ar Xiv preprint ar Xiv:1511.05952, 2015. [19] Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.

Dropout: A simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014. [20] Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement

learning with deep predictive models. ar Xiv preprint ar Xiv:1507.00814, 2015. [21] Malcolm J. A. Strens. A bayesian framework for reinforcement learning. In ICML, pages

943 950, 2000. [22] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press,

March 1998. [23] Gerald Tesauro. Temporal diﬀerence learning and td-gammon. Communications of the ACM,

38(3):58 68, 1995. [24] W.R. Thompson. On the likelihood that one unknown probability exceeds another in view of

the evidence of two samples. Biometrika, 25(3/4):285 294, 1933. [25] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double

q-learning. ar Xiv preprint ar Xiv:1509.06461, 2015. [26] Ziyu Wang, Nando de Freitas, and Marc Lanctot. Dueling network architectures for deep

reinforcement learning. ar Xiv preprint ar Xiv:1511.06581, 2015. [27] Zheng Wen and Benjamin Van Roy. Eﬃcient exploration and value function generalization in

deterministic systems. In NIPS, pages 3021 3029, 2013.