# lookaheadbounded_qlearning__32c491e7.pdf Lookahead-Bounded Q-Learning Ibrahim El Shar 1 Daniel R. Jiang 1 We introduce the lookahead-bounded Q-learning (LBQL) algorithm, a new, provably convergent variant of Q-learning that seeks to improve the performance of standard Q-learning in stochastic environments through the use of lookahead up per and lower bounds. To do this, LBQL employs previously collected experience and each itera tion s state-action values as dual feasible penal ties to construct a sequence of sampled infor mation relaxation problems. The solutions to these problems provide estimated upper and lower bounds on the optimal value, which we track via stochastic approximation. These quantities are then used to constrain the iterates to stay within the bounds at every iteration. Numerical experi ments on benchmark problems show that LBQL exhibits faster convergence and more robustness to hyperparameters when compared to standard Q-learning and several related techniques. Our approach is particularly appealing in problems that require expensive simulations or real-world interactions. 1 Introduction Since its introduction by Watkins (1989), Q-learning has become one of the most widely-used reinforcement learning (RL) algorithms (Sutton & Barto, 2018), due to its concep tual simplicity, ease of implementation, and convergence guarantees (Jaakkola et al., 1994; Tsitsiklis, 1994; Bertsekas & Tsitsiklis, 1996). However, practical, real-world applica tions of Q-learning are diffcult due to the often high cost of obtaining data and experience from real environments, along with other issues such as overestimation bias (Szepesv ari, 1998; Even-Dar & Mansour, 2003; Lee & Powell, 2019). In this paper, we address these challenges for a specifc class of problems with partially known transition models. We write the system dynamics as st+1 = f(st, at, wt+1), 1Department of Industrial Engineering, University of Pittsburgh, PA, USA. Correspondence to: Ibrahim El Shar . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the au thor(s). where st and at are the current state and action, st+1 is the next state, wt+1 is random noise, and f is the transition function. We focus on problems where f is known, but the noise wt+1 can only be observed through interactions with the environment. This type of model is the norm in the con trol (Bertsekas & Tsitsiklis, 1996) and operations research (Powell, 2007) communities. In this work, we propose and analyze a new RL algorithm called lookahead-bounded Q-learning (LBQL), which exploits knowledge of the transi tion function f to improve the effciency of Q-learning and address overestimation bias. It does so by making better use of the observed data through estimating upper and lower bounds using a technique called information relaxation (IR) (Brown et al., 2010). Indeed, there are abundant real-world examples that fall into this subclass of problems, as we now illustrate with a few examples. In inventory control, the transition from one in ventory state to the next is a well-specifed function f given knowledge of a stochastic demand wt+1 (Kunnumkal & Topaloglu, 2008). For vehicle routing, f is often simply the routing decision itself, while wt+1 are exogenous demands that require observation (Secomandi, 2001). In energy oper ations, a typical setting is to optimize storage that behaves through linear transitions f together with unpredictable re newable supply, wt+1 (Kim & Powell, 2011). In portfolio optimization, f is the next portfolio, and wt+1 represents random prices (Rogers, 2002). In Section 5 of this paper, we discuss in detail another application domain that follows this paradigm: repositioning and spatial dynamic pricing for car sharing (He et al., 2019; Bimpikis et al., 2019). Although we specialize to problems with partially known transition dynamics, this should not be considered restric tive: in fact, our proposed algorithm can be integrated with the framework of model-based RL to handle the standard model-free RL setting, where f is constantly being learned. We leave this extension to future work. Main Contributions. We make the following methodolog ical and empirical contributions in this paper. 1. We propose a novel algorithm that takes advantage of IR theory and Q-learning to generate upper and lower bounds on the optimal value. This allows our algorithm to mitigate the effects of maximization bias, while making better use of the collected experience and Lookahead-Bounded Q-Learning the transition function f. A variant of the algorithm based on experience replay is also given. 2. We prove that our method converges almost surely to the optimal action-value function. The proof requires a careful analysis of several interrelated stochastic pro cesses (upper bounds, lower bounds, and the Q-factors themselves). 3. Numerical experiments on fve test problems show superior empirical performance of LBQL compared to Q-learning and other widely-used variants. Moreover, sensitivity analysis shows that LBQL is more robust to learning rate and exploration parameters. The rest of the paper is organized as follows. In the next section, we review the related literature. In Section 3, we introduce the notation and review the basic theory of IR. In Section 4, we present our algorithm along with its theoretical results. In Section 5, we show the numerical results where LBQL is compared to other Q-learning variants. Finally, we state conclusions and future work in Section 6. 2 Related Literature Upper and lower bounds on the optimal value have recently been used by optimism-based algorithms, e.g., Dann et al. (2018) and Zanette & Brunskill (2019). These papers focus on fnite horizon problems, while we consider the infnite horizon case. Their primary use of the lower and upper bounds is to achieve better exploration, while our work is fo cused on improving the action-value estimates by mitigating overestimation and enabling data re-use. In the context of real-time dynamic programming (RTDP) (Barto et al., 1995), Bounded RTDP (Mc Mahan et al., 2005), Bayesian RTDP (Sanner et al., 2009) and Focused RTDP (Smith & Simmons, 2006) propose extensions of RTDP where a lower bound heuristic and an upper bound are maintained on the value function. These papers largely use heuristic approaches to obtain bounds, while we use the principled idea of IR duality. More closely related to our paper is the work of He et al. (2016), which exploits multistep returns to construct bounds on the optimal action-value function, before utilizing con strained optimization to enforce those bounds. However, unlike our work, no theoretical guarantees are provided. To the best of our knowledge, we provide the frst asymptotic proof of convergence to the general approach of enforcing dynamically computed (noisy) bounds. There are also two papers that utilize IR bounds in the re lated setting of fnite horizon dynamic programming. Jiang et al. (2020) use IR dual bounds in a tree search algorithm in order to ignore parts of the tree. Recent work by Chen et al. (2020) uses IR duality in a duality-based dynamic programming algorithm that converges monotonically to the optimal value function through a series of subsolu tions under more restrictive assumptions (e.g., knowledge of probability distributions). 3 Background In this section, we frst introduce some defnitions and con cepts from Markov decision process (MDP) theory. Then, we describe the basic theory of information relaxations and duality, which is the main tool used in our LBQL approach. 3.1 MDP Model Consider a discounted, infnite horizon MDP with a fnite state space S, and a fnite action space A, and a disturbance space W. Let {wt} be a sequence of independent and identi cally distributed (i.i.d.) random variables defned on a prob ability space (Ω, F, P), where each wt is supported on the set W. Let st S be the state of the system at time t. We also defne a state transition function f : S A W S, such that if action at is taken at time t, then the next state is governed by st+1 = f(st, at, wt+1). This transition function model of the MDP is more convenient for our purposes, but we note that it can easily be converted to the standard model used in RL, where the transition probabil ities, p(st+1 | st, at), are modeled directly. For simplicity and ease of notation, we assume that w is independent1 from (s, a). Let r(st, at) be the expected reward when tak ing action at A in state st S. We assume that the rewards r(st, at) are uniformly bounded by Rmax and for simplicity in notation, that the feasible action set A does not depend on the current state. As usual, a deterministic Markov policy π Π is a mapping from states to actions, such that at = π(st) whenever we are following policy π. We let Π be the set of all possible policies (or the set of all admissible policies). Given a discount factor γ (0, 1) and a policy π Π, the value and the action-value functions are denoted respec tively by " # X V π(s) = E γt r(st, at) π, s0 = s and " t=0 # X Qπ (s, a) = E γt r(st, at) s0 = s, a0 = a, π , where the notation of conditioning on π refers to ac tions at selected by π(st). The expectation E, here and throughout the paper, is taken with respect to P. Our ob jective is to fnd a policy π Π such that from any initial state s, it achieves the optimal expected discounted cumu lative reward. The value of an optimal policy π for a state s is called the optimal value function and is denoted 1However, we can also allow for (s, a)-dependent w with es sentially no fundamental changes to our approach. Lookahead-Bounded Q-Learning by V (s) = maxπ V π(s). Specifcally, it is well-known that an optimal policy selects actions according to π (s) = arg maxa A Q (s, a), where Q (s, a) = maxπ Qπ(s, a) is the optimal action-value function (Puterman, 2014). The Bellman optimality equation gives the following recursion: h i Q (st, at) = r(st, at) + γ E max Q (st+1, at+1) . at+1 The goal in many RL algorithms, including Q-learning (Watkins, 1989), is to approximate Q . 3.2 Information Relaxation Duality Now let us give a brief review of the theory behind informa tion relaxation duality from Brown et al. (2010), which is a way of computing an upper bound on the value and actionvalue functions. This generalizes work by Rogers (2002), Haugh & Kogan (2004), and Andersen & Broadie (2004) on pricing American options. Note that any feasible policy provides a lower bound on the optimal value, but computing an upper bound is less straightforward. The information re laxation approach proposes to relax the non-anticipativity constraints on policies, i.e., it allows them to depend on realizations of future uncertainties when making decisions. Naturally, optimizing in the class of policies that can see the future provides an upper bound on the best admissible policy. We focus on the special case of perfect information relaxation, where full knowledge of the future uncertainties, i.e., the sample path (w1, w2, . . .), is used to create upper bounds. The naive version of the perfect information bound is simply given by " ( )# X V (s0) E max γt r(st, at) , a which, in essence, is an interchange of the expectation and max operators; the interpretation here is that an agent who is allowed to adjust her actions after the uncertainties are realized achieves higher reward than an agent who acts sequentially. As one might expect, perfect information can provide upper bounds that are quite loose. The central idea of the information relaxation approach to strengthen these upper bounds is to simultaneously (1) allow the use of future information but (2) also penalize the agent for doing so by assessing a penalty on the reward function in each period. A penalty function is said to be dual feasible if it does not penalize any admissible policy π Π in expectation. Let st+1 = f(st, at, wt+1) be the next state, ϕ : S A R be a bounded function, and w have the same distribution as wt+1. Then, penalties involving terms of the form π z (st, at, wt+1 | ϕ) := γt+1 ϕ(st+1, π(st+1)) t h i (1) E ϕ f(st, at, w), π(f(st, at, w)) are dual feasible because " # X π E z (st, at, wt+1 | ϕ) = 0. t t=0 This is a variant of the types of penalties introduced in Brown & Haugh (2017), extended to the case of actionvalue functions. Intuitively, if ϕ is understood to be an estimate of the optimal action-value function Q and π an π estimate of the optimal policy π , then z can be thought t of as the one-step value of future information (i.e., knowing wt+1 versus taking an expectation over its distribution). These terms, however, may have negative expected value for policies that violate non-anticipativity constraints. Let πϕ be the policy that is greedy with respect to the bounded function ϕ (considered to be an approximate action-value function). Consider the problem constructed by subtracting the penalty term from the reward in each period and relaxing non-anticipativity constraints by interchanging maximiza tion and expectation: n h X QU (s0, a0) = E max γt r(st, at) a t=0 (2) oi πϕ z (st, at, wt+1 | ϕ) , t where a := (a0, a1, . . .) is an infnite sequence of ac tions. Brown & Haugh (2017) shows that the objective value of this problem, QU (s0, a0), is an upper bound on Q (s0, a0). Our new approach to Q-learning will take ad vantage of this idea, with ϕ and π being continuously up dated. Notice that in principle, it is possible to estimate the problem in (2) using Monte Carlo simulation. To do this, we generate infnitely long sample paths of the form w = (w1, w2, . . .), and for each fxed w, we solve the inner deterministic dynamic programming (DP) problem. Averag ing over the results produces an estimate of the upper bound of QU (s0, a0). 3.3 Absorption Time Formulation In practice, however, we cannot simulate infnitely long sample paths w. One solution is to use an equivalent for mulation with a fnite, but random, horizon (see for e.g. Proposition 5.3.1 of Puterman 2014), where instead of dis counting, a new absorbing state s with zero reward is added to the state space S. This new state s can be reached from every state and for any feasible action with probability 1 γ. We defne a new state transition function h, which transi tions to s with probability 1 γ from every (s, a), but conditioned on not absorbing (i.e., with probability γ), h is identical to f. We refer to this as the absorption time formu lation, where the horizon length τ := min{t : st = s } has a geometric distribution with parameter 1 γ and the state transitions are governed by the state transition function h instead of f. Let Q be the set of bounded functions ϕ such Lookahead-Bounded Q-Learning that ϕ( s, a) = 0 for all a A. The penalty terms for the absorption time formulation are defned in a similar way as (1), except we now consider ϕ Q: t (st, at, wt+1 | ϕ) := ϕ(st+1, π(st+1)) h i (3) E ϕ h(st, at, w), π(h(st, at, w)) , where st+1 = h(st, at, wt+1). We now state a proposition that summarizes the information relaxation duality results, which is a slight variant of results in Proposition 2.2 of Brown & Haugh (2017). Proposition 1 (Duality Results, Proposition 2.2 in Brown & Haugh (2017)). The following duality results are stated for the absorption time formulation of the problem. (i) Weak Duality: For any π Π and ϕ Q, τ 1 h X Qπ (s0, a0) E max r(st, at) a πϕ ζ (st, at, wt+1 | ϕ) t (ii) Strong Duality: It holds that h τ 1 X Q (s0, a0) = inf E max r(st, at) ϕ Q a t=0 (5) i πϕ ζ (st, at, wt+1 | ϕ) , t with the infmum attained at ϕ = Q . The DP inside the expectation of the right hand side of (4) is called the inner DP problem. Weak duality tells us that by using a dual feasible penalty, we can get an estimated upper bound on the optimal action-value function Q (s0, a0) by simulating multiple sample paths and averaging the opti mal value of the resulting inner problems. Strong duality suggests that the information gained from accessing the fu ture is perfectly cancelled out by the optimal dual feasible penalty. For a given sample path w = (w1, w2, . . . , wτ ), each of the inner DP problems can be solved via the backward recursion πϕ QU (st, at) = r(st, at) ζ (st, at, wt+1 | ϕ) t t (6) + max Qt +1(st+1, a), a for t = τ 1, τ 2, . . . , 0 with st+1 = h(st, at, wt+1) and QU τ 0 (as there is no additional reward after entering the absorbing state s ). The optimal value of the inner problem is given by QU 0 (s0, a0). 3.4 Lower Bounds using IR The penalty function approach also allows for using a feasi ble policy to estimate a lower bound on the optimal value, such that when using a common sample path, this lower bound is guaranteed to be less than the corresponding esti mated upper bound, a crucial aspect of our theoretical analy sis. Specifcally, given a sample path (w1, w2, . . . , wτ ), the inner problem used to evaluate a feasible policy π Π is given by QL(st, at) = r(st, at) ζπ(st, at, wt+1 | ϕ) t t (7) + QL t+1(st+1, π(st+1)), for t = 0, . . . , τ 1, with st+1 = h(st, at, wt+1) and QL 0. It follows that E QL 0 (s0, a0) = Qπ(s0, a0), as τ the penalty terms ζπ(st, at, wt+1 | ϕ) have zero mean. t 4 QL with Lookahead Bounds We now introduce our proposed approach, which integrates the machinery of IR duality with Q-learning in a unique way. An outline of the essential steps is given below. 1. On a given iteration, we frst experience a realization of the exogenous information wt+1 and make a standard Q-learning update. 2. We then set ϕ to be the newly updated Q-iterate and compute noisy upper and lower bounds on the true Q , which are then tracked and averaged using a stochastic approximation step. 3. Finally, we project the Q-iterate so that it satisfes the averaged upper and lower bounds and return to Step 1. Figure 1 shows an illustration of each of these steps at a given iteration of the algorithm. Since we are setting ϕ to be the current Q-iterate at every iteration, the information relaxation bounds are computed using a dynamic sequence of penalty functions and averaged together using stochastic approximation. The idea is that as our approximation of Q improves, our upper and lower bounds also improve. As the upper and lower bounds improve, the projection step further improves the Q-iterates. It is this back-and-forth feedback between the two processes that has the potential to yield rapid convergence toward the optimal Q . Lookahead samples + Π[Ln+1,Un+1 ][Qn+1] Figure 1: Illustration of LBQL Algorithm at iteration n. The primary drawback of our approach is that in the com putation of the information relaxation dual bounds, expec tations need to be computed. We frst show an idealized Lookahead-Bounded Q-Learning version of the algorithm where these expectations are esti mated using unbiased samples of wt+1 from a black-box simulator. Later, we relax the need for a black-box simulator and show how our algorithm can be implemented with a replay-buffer. Both versions are analyzed theoretically and convergence results are provided. 4.1 An Idealized Algorithm 1 2 K Let {wt+1, wt+1, . . . , wt+1} be a batch (as opposed to a sample path) of K samples from the distribution of the ex ogenous information wt+1 (i.e., from a black-box simulator). An empirical version of (3) is simply given by: ζˆπ(st, at, wt+1 | ϕ) := ϕ(st+1, π(st+1)) t K 1 X (8) k k ϕ h(st, at, wt+1), π(h(st, at, wt+1)) , K k=1 where st+1 = h(st, at, wt+1). Given a sample path w = (w1, w2, . . . , wτ ) of the absorption time formulation of the problem, analogues to (6) and (7) using ζˆπ, where in (7) we t set π = πϕ (i.e., the lower bound on the optimal value is constructed by approximately evaluating the feasible policy πϕ) are given by πϕ QˆU (st, at) = r(st, at) ζˆ (st, at, wt+1 | ϕ) t t (9) + max Qˆ t +1(st+1, a) a QˆL t (st, at) = r(st, at) ζˆ t πϕ (st, at, wt+1 | ϕ) (10) + Qˆ t +1(st+1, πϕ(st+1)) for t = 0, 1, . . . , τ 1, where st+1 = h(st, at, wt+1), QˆU QˆL 0, and we assume that each call to ζˆπ uses a τ τ t fresh batch of K samples. Proposition 2. The valid upper and lower bound properties continue to hold in the empirical case: L(s, a)] Q (s, a) E[Qˆ 0 for any state-action pair (s, a). We include the proof in Appendix A.1. The proof is similar to that of Proposition 2.3(iv) of Brown et al. (2010), except extended to the infnite horizon setting with the absorption time formulation. A detailed description of the LBQL al gorithm is given in Algorithm 1, where we use n for the iteration index in order to avoid confusion with the t used in the inner DP problems. We use Π[a,b][x] to denote x projected onto [a, b], i.e., Π[a,b][x] = max{min{x, b}, a}, where either a or b could be . Let ρ = Rmax/(1 γ), the initial lower and upper bounds estimates are set such that L0(s, a) = ρ and U0(s, a) = ρ for all (s, a) S A. The initial action-value Q0 is set arbitrarily such that L0(s, a) Q0(s, a) U0(s, a) for all (s, a) S A. Algorithm 1 Lookahead-Bounded Q-Learning Input: Initial estimates L0 Q0 U0, batch size K, and stepsize rules αn(s, a), βn(s, a). Output: Approximations {Ln}, {Q0 }, and {Un}. n Set Q0 = Q0 and choose an initial state s0. 0 for n = 0, 1, 2, . . . do Choose an action an via some behavior policy (e.g., ϵ-greedy). Observe wn+1. Let h Qn+1(sn, an) = Q0 n(sn, an) + αn(sn, an) rn(sn, an) i + γ max Q0 (sn+1, a) Q0 (sn, an) . (11) n n a Set ϕ = Qn+1. Using one sample path w, compute QˆU 0 (sn, an) and Qˆ 0 L(sn, an) using (9) and (10). Update and enforce upper and lower bounds: h Un+1(sn, an) = Π[ ρ, ] Un(sn, an) i + βn(sn, an) Qˆ 0 U (sn, an) Un(sn, an) , (12) h Ln+1(sn, an) = Π[ , ρ] Ln(sn, an) i + βn(sn, an) Qˆ 0 L(sn, an) Ln(sn, an) , (13) n+1(sn, an) = Π[Ln+1(sn,an), Un+1(sn,an)] [Qn+1(sn, an)] . (14) 1, 0.5 L R R 1, 0.5 1, 0.5 1, 0.5 1, 0.5 Figure 2: A simple stochastic MDP. Example 1. We demonstrate the idealized LBQL algorithm using the simple MDP shown in Figure 2. The MDP has two non-terminal states A and B. Each episode starts in state A, with a choice of two actions: right and left de noted by R and L respectively. The rewards and transi tion probabilities of taking an action in each state are shown on the edges in the fgure. Assume that the tran sitions are governed by the outcome of a fair coin. If the outcome is Head then we transition in the direction of our chosen action and in the opposite direction for a Tail outcome. For a discount factor γ = 0.95, the op timal policy is to go right at both A and B. The optimal action-values are given by Q (A, R) = Q (B, R) = 0, Q (A, L) = Q (B, L) = 1. Consider applying the ideal ized version of LBQL described in Algorithm 1. We let αn = 0.1, βn = 0.05 for all n. Figure 3 illustrates two iterations from the frst and the 58th episodes. Initially Lookahead-Bounded Q-Learning next iter. B A next iter. Episode 1: A Episode 58 : B L R L R L R L R Qn : 0 0.1 0 0.1 1.05 0.15 0.87 0.12 Ln : 20 18.9 20 18.98 5.56 0.12 8.29 0.08 Q0 n : 0 0.1 0 0.1 1.05 0.07 0.87 0.08 Un : 20 19.3 20 19.02 4.08 0.07 4.98 0.05 Figure 3: An illustration of LBQL iterates for Example 1. Q0(s, a) = 0 and ρ = 20. After one episode the bounds are still loose, so we have Q1(A, R) = Q0 1(A, R) = 0.1. At episode 58 (281 iterations): learning has occurred for the lower and upper bounds values for the right action at A and B. We see that the bounds are effective already in keeping the Q-iterate close to Q . Interestingly, the upper bound is enforced at A, while the lower bound is enforced at B. Note that these are the results of a real simulation. 4.2 Analysis of Convergence In this section, we analyze the convergence of the idealized version of the LBQL algorithm to the optimal action-value function Q . We start by summarizing and developing some important technical results that will be used in our analysis. All proofs are presented in Appendix A. The following proposition establishes the boundedness of the action-value iterates and asymptotic bounds on the Ln and Un iterates of Algorithm 1, which are needed in our proof of convergence. The proof of this proposition is pre sented in Section A.2 in the Appendix. Proposition 3 (Boundedness). For all (s, a) S A, we have the following: (i) The iterates Qn(s, a) and Q0 (s, a), remains bounded n for all (s, a) S A and for all n. (ii) For every η > 0, and with probability one, there exists some fnite iteration index n0 such that Ln(s, a) Q (s, a)+η and Q (s, a) η Un(s, a), for all iterations n n0. Proposition 3(i) ensures that at each iteration n the actionvalue iterates Qn and Q0 are bounded. This allows us to set n ϕ = Qn+1 at each iteration of Algorithm 1 and is required to establish convergence in general. The proof is based on showing an inductive relationship that connects Qn and Q0 n to the previous lower and upper bound iterates. Specifcally, we show that both action-value iterates are bounded below by the preceding upper bound iterates and above by the pre ceding lower bound iterates. Proposition 3(ii) ensures that there exists a fnite iteration after which the lower and upper bound iterates Ln and Un are lower and upper bounds on the optimal action-value function Q with an error margin of at most an arbitrary amount η > 0. In the proof of Propo sition 3(ii), we bound the lower and upper bound iterates by a noise process and another sequence that converges to Q . We show that the noise process possesses some properties that help to eliminate the effect of the noise asymptotically. With the effects of the noise terms vanishing, the bounded ness of the lower and upper bound iterates by Q is achieved. Examining the update equations (12) and (13) for Un+1 and Ln+1 in Algorithm 1, we remark that they are not standard stochastic approximation or stochastic gradient updates be cause QˆU and QˆL are computed with iteration-dependent 0 0 penalty functions generated by ϕ = Qn+1. In other words, the noiseless function itself is changing over time. The proof of Proposition 3(ii) essentially uses the fact that even though these updates are being performed with respect to different underlying functions, as long as we can apply Proposition 2 in every case, then after the noise is accounted for, the averaged values Un+1 and Ln+1 are eventually bounded below and above by Q , respectively. The following lemma derives some guarantees on the lower and upper bound iter ates of Algorithm 1, whose proof appears in Section A.3 of the Appendix. Lemma 1 (Consistency of Bounds). If L0(s, a) U0(s, a), then Ln(s, a) Un(s, a) for all iterations n and for all (s, a) S A. In particular, Lemma 1 shows that the upper and lower bound iterates do not interchange roles and become inconsis tent. This is an important property; otherwise, the projection step of Algorithm 1 loses its meaning and would require additional logic to handle inconsistent bounds. The results of Lemma 1 follows mainly by the fact that we are using the same sample path to solve the upper and lower bound inner problems, (9) and (10), respectively. Before stating our convergence results, we frst state a typical assumption on the stepsizes and the state visits. Assumption 1. We assume that: P P (i) αn(s, a) = , α2 (s, a) < , n=0 n=0 n P P βn(s, a) = , β2 (s, a) < , n=0 n=0 n (ii) Each state s S is visited infnitely often with prob ability one. We now state one of our main theoretical results. Theorem 1 (Convergence of LBQL). Under Assumption 1, the following hold with probability 1: (i) Q0 (s, a) in Algorithm 1 converges to the optimal n action-value function Q (s, a) for all state-action pairs (s, a). (ii) If the penalty terms are computed exactly, i.e. as per (3), then the iterates Ln(s, a), Q0 (s, a), Un(s, a) n in Algorithm 1 converge to the optimal action-value function Q (s, a) for all state-action pairs (s, a). Due to the interdependent feedback between Q, U, and L, it is not immediately obvious that the proposed scheme does not diverge. The primary challenge in the analysis for this theorem is to handle this unique aspect of the algorithm. Lookahead-Bounded Q-Learning 4.3 LBQL with Experience Replay We now introduce a more practical version of LBQL that uses experience replay in lieu of a black-box simulator. Here, we use a noise buffer B to record the unique noise values w that are observed at every iteration. We further assume that the noise space W is fnite, a reasonable as sumption for a fnite MDP. The buffer B is used in two ways: (1) to generate the sample path w and (2) to esti mate the expectation in the penalty function. Here, we track and update the distribution of the noise w after every iteration and directly compute the expectation under this distribution instead of sampling a batch of size K, as we did previously. To illustrate how this can be done, suppose W = {wa, wb, wc, wd} and that at iteration n we observe wn+1 = wa. Let pa denote the probability of observing wa, and Nn(wa) the number of times wa is observed in the frst n iterations, then the empirical estimate of pa is given by pˆn(wa) = Nn(wa)/n.2 We denote by Eˆ n[ . ] the expectation computed using the empirical distribution pˆn. To differentiate the penalty and the action-values (solutions to the inner problems) that are computed from the buffer from those defned in the idealized version of the algorithm, we defne: ζ π(st, at,w | ϕ) := ϕ(st+1, π(st+1)) t (15) Eˆ n[ϕ h(st, at, w), π(h(st, at, w)) ], and given a sample path w = (w1, w2, . . . , wτ ) the inner problems analogous to (9) and (10) are given by Q U (st, at) = r(st, at) ζ πϕ (st, at, wt+1 | ϕ) t t (16) + max Qt +1(st+1, a) a πϕ QL(st, at) = r(st, at) ζ (st, at, wt+1 | ϕ) (17) t t + Q t +1(st+1, πϕ(st+1)) for t = 0, 1, . . . , τ 1, where st+1 = h(st, at, wt+1) and Q U Q L 0. The pseudo-code of LBQL with experience τ τ replay is shown in Algorithm 2 in Appendix B. 4.4 Convergence of LBQL with Experience Replay In this section, we prove that the version of LBQL with experience replay also converges to the optimal action-value function. We start by stating a lemma that confrms Proposi tion 3 and Lemma 1 still hold when the penalty terms are computed using (15). Lemma 2. If at any iteration n, the penalty terms are com puted using the estimated distribution pˆn, i.e., as per (15), then Proposition 3 and Lemma 1 still hold. 2Note that LBQL could, in principle, be adapted to the case of of continuous noise (i.e., where w is continuous random variable) using methods like kernel density estimation (KDE). Theorem 2 (Convergence of LBQL with experience replay). Under Assumption 1, the following hold with probability 1: (i) Q0 (s, a) in Algorithm 2 converges to the optimal n action-value function Q (s, a) for all state-action pairs (s, a). (ii) The iterates Ln(s, a), Q0 (s, a), Un(s, a) in Algo n rithm 2 converge to the optimal action-value function Q (s, a) for all state-action pairs (s, a). The proof is similar to that of Theorem 1, but using the observations collected in the buffer naturally results in an additional bias term in our analysis. The proof of Lemma 2 shows that as we learn the distribution of the noise, this bias term goes to zero and our original analysis in the unbiased case continues to hold. Notice that the results in part (ii) of the theorem are, in a sense, stronger than that of Theorem 1(ii). While both achieve asymptotic convergence of the lower and upper bounds to the optimal action-value function, Theorem 2(ii) does not require computing the penalty with the true dis tribution, i.e., using (3). This is because in the experience replay version, the distribution of the noise random variables is also learned. 5 Numerical Experiments In our numerical experiments we make slight modifcations to Algorithm 2, which help to reduce its computational re quirements. A detailed description of all changes is included in Appendix C. We also open-source a Python package3 for LBQL that reproduces all experiments and fgures pre sented in this paper. We compare LBQL with experience replay with several algorithms: Q-learning (QL), double Q-learning (Double-QL), speedy Q-learning (SQL), and bias-corrected Q-learning (BCQL) (van Hasselt, 2010; Azar et al., 2011; Lee & Powell, 2019). The environments that we consider are summarized below. Detailed description of the environments, the parameters used for the fve algorithms, and sensitivity analysis are deferred to Appendix D. Windy Gridworld (WG). This is a well-known variant of the standard gridworld problem discussed in Sutton & Barto (2018). There is an upward wind with a random intensity. The agent moves extra steps in the wind direction whenever it reaches an affected square. The reward is 1 until the goal state is reached, and the reward is 0 thereafter. Stormy Gridworld (SG). We then consider a new do main that adds the additional complexity of rain and multidirectional wind to windy gridworld. The location of the rain is random and when it occurs, puddles that provide negative rewards are created. The reward is similar to that of WG, except that puddle states provide a reward of 10. 3https://github.com/ibrahim-elshar/LBQL ICML2020. Lookahead-Bounded Q-Learning (A) WG (B) SG (C) 2-CS-R (D) 2-CS (E) 4-CS Figure 4: Illustration of LBQL Upper and Lower Bounds. Repositioning in Two-Location Car-sharing (2-CS-R). Our next benchmark is a synthetic problem of balancing an inventory of cars by repositioning them in a car-sharing platform with two stations (He et al., 2019). The actions are to decide on the number of cars to be repositioned from one station to the other before random demand is realized. All rentals are one-way (i.e., rentals from station A end up at B, and vice-versa). The goal is to maximize revenue for a fxed rental price subject to lost sales and repositioning costs. Pricing in Two-Location Car-sharing (2-CS). Here, we consider the benchmark problem of spatial dynamic pric ing on a car-sharing platform with two stations, motivated partially by (Bimpikis et al., 2019). The actions are to set a price at each station, which infuence the station s (stochas tic) demand for rentals. Rentals are one-way and the goal is to maximize revenue under lost sales cost. Pricing in Four-Location Car-sharing (4-CS). The fnal benchmark that we consider is a variant of the above pricing problem with four stations. Now, however, we consider both one way and return trips at each station. In this case, we have two sources of randomness: the noise due to stochastic demand and the noise due to the random distribution of fulflled rentals between the stations. but as the upper bound (green) becomes better estimated, the LBQL iterates are pushed toward the optimal value (dotted black). We see that even though the same hyperparameters are used between LBQL and QL, the new approach is able to quickly converge. In the 4-CS example, Figure 4E, Q is not shown since it is computationally diffcult to obtain, but the gap between the upper and lower bounds, along with Theorem 2(ii), suggest that LBQL is converging faster than standard Q-learning. The full results (with 95% confdence intervals) of the nu merical experiments are shown in Figure 5. LBQL dras tically outperforms the other algorithms in terms of the performance curve on the gridworld domains, but for the car-sharing problems, double Q-learning is superior in the frst 20,000 steps. Afterwards, LBQL catches up and re mains the best performing algorithm. From the relative error plots (which measure the percent error, in l2-norm, of the approximate value function with the true optimal value function, i.e., k Vn V k2/k V k2), we see that LBQL has the steepest initial decline. In windy gridworld and carsharing, LBQL outperforms other algorithms in terms of relative error, but BCQL and SQL achieve slightly lower relative error than LBQL for stormy gridworld. First, we illustrate conceptually in Figure 4 how the upper and lower bounds of LBQL can squeeze the Q-learning results toward the optimal value (the plots show a partic ular state-action pair (s, a) for illustrative reasons). For example, in Figure 4A, we observe that the LBQL iterates (orange) match the Q-learning iterates (solid black) initially, We also conducted a set of sensitivity analysis experiments, where we varied the learning rate and exploration hyperparameters across all algorithms (results are given in Ap pendix D.3). We examine the number of iterations and CPU time needed to reach 50%, 20%, 5%, and 1% relative error. The results show that LBQL outperforms BCQL, SQL, and Lookahead-Bounded Q-Learning (A) Performance (WG) (B) Relative Error (WG) (C) Performance (SG) (D) Relative Error (SG) (E) Performance (2-CS-R) (F) Relative Error (2-CS-R) (G) Performance (2-CS) (H) Relative Error (2-CS) (I) Performance (4-CS) Figure 5: Results from Numerical Experiments. QL in terms of both iterations and CPU time in reaching 20%, 5%, and 1% relative error across all 15 hyperparam eter settings that we tested. For the case of 50% relative error, BCQL outperforms LBQL in fve out of 15 cases. This indicates that LBQL is signifcantly more robust to hyperparameter settings than the other algorithms. Roughly speaking, this robustness might be attributed to the approx imate planning aspect of the algorithm, where lower and upper bounds are computed. 6 Conclusion In this paper, we present the LBQL algorithm and prove its almost sure convergence to the optimal action-value func tion. We also propose a practical extension of the algorithm that uses an experience replay buffer. Numerical results illustrate the rapid convergence of our algorithm empiri cally when compared to a number of well-known variants of Q-learning on fve test problems. LBQL is shown to have superior performance, robustness against learning rate and exploration strategies, and an ability to mitigate maximiza tion bias. Interesting future work is the extension of our new frame work to the model-based RL setting, where the transition function f is learned while the policy is optimized. Other interesting future work includes looking beyond the tabu lar case and adapting our algorithm to the setting of value function approximations, such as DQN (Mnih et al., 2013). Andersen, L. and Broadie, M. Primal-dual simulation al gorithm for pricing multidimensional American options. Management Science, 50(9):1222 1234, 2004. Azar, M. G., Munos, R., Ghavamzadaeh, M., and Kappen, H. J. Speedy Q-learning. In Advances in Neural Informa tion Processing Systems 24, 2011. Lookahead-Bounded Q-Learning Barto, A. G., Bradtke, S. J., and Singh, S. P. Learning to act using real-time dynamic programming. Artifcial intelligence, 72(1-2):81 138, 1995. Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic pro gramming, volume 5. Athena Scientifc Belmont, MA, 1996. Bimpikis, K., Candogan, O., and Saban, D. Spatial pricing in ride-sharing networks. Operations Research, 2019. Brown, D. B. and Haugh, M. B. Information relaxation bounds for infnite horizon markov decision processes. Operations Research, 65(5):1355 1379, 2017. Brown, D. B., Smith, J. E., and Sun, P. Information re laxations and duality in stochastic dynamic programs. Operations Research, 58(4-part-1):785 801, 2010. Chen, N., Ma, X., Liu, Y., and Yu, W. Information relaxation and a duality-driven algorithm for stochastic dynamic programs. ar Xiv preprint ar Xiv:2007.14295, 2020. Dann, C., Li, L., Wei, W., and Brunskill, E. Policy cer tifcates: Towards accountable reinforcement learning. ar Xiv preprint ar Xiv:1811.03056, 2018. Even-Dar, E. and Mansour, Y. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1 25, 2003. Haugh, M. B. and Kogan, L. Pricing American options: a duality approach. Operations Research, 52(2):258 270, 2004. He, F. S., Liu, Y., Schwing, A. G., and Peng, J. Learning to play in a day: Faster deep reinforcement learning by optimality tightening. ar Xiv preprint ar Xiv:1611.01606, 2016. He, L., Hu, Z., and Zhang, M. Robust repositioning for vehicle sharing. Manufacturing & Service Operations Management, 2019. Jaakkola, T., Jordan, M. I., and Singh, S. P. Convergence of stochastic iterative dynamic programming algorithms. In Advances in Neural Information Processing Systems, pp. 703 710, 1994. Jiang, D. R., Al-Kanj, L., and Powell, W. B. Optimistic Monte Carlo tree search with sampled information relax ation dual bounds. Operations Research (forthcoming), 2020. Kim, J. H. and Powell, W. B. Optimal energy commit ments with storage and intermittent supply. Operations Research, 59(6):1347 1360, 2011. Kunnumkal, S. and Topaloglu, H. Using stochastic approxi mation methods to compute optimal base-stock levels in inventory control problems. Operations Research, 56(3): 646 664, 2008. Kushner, H. and Yin, G. G. Stochastic approximation and re cursive algorithms and applications, volume 35. Springer Science & Business Media, 2003. Lee, D. and Powell, W. B. Bias-corrected Q-learning with multistate extension. IEEE Transactions on Automatic Control, 2019. Mc Mahan, H. B., Likhachev, M., and Gordon, G. J. Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees. In Proceedings of the 22nd international conference on Ma chine learning, pp. 569 576. ACM, 2005. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. Powell, W. B. Approximate Dynamic Programming: Solving the curses of dimensionality, volume 703. John Wiley & Sons, 2007. Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014. Rogers, L. C. G. Monte Carlo valuation of American options. Mathematical Finance, 12(3):271 286, 2002. Sanner, S., Goetschalckx, R., Driessens, K., and Shani, G. Bayesian real-time dynamic programming. In Twenty First International Joint Conference on Artifcial Intelli gence, 2009. Secomandi, N. A rollout policy for the vehicle routing problem with stochastic demands. Operations Research, 49(5):796 802, 2001. Smith, T. and Simmons, R. Focused real-time dynamic pro gramming for mdps: Squeezing more out of a heuristic. In AAAI, pp. 1227 1232, 2006. Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction. MIT press, 2018. Szepesv The asymptotic convergence-rate of Q ari, C. learning. In Advances in Neural Information Processing Systems, pp. 1064 1070, 1998. Tsitsiklis, J. N. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3):185 202, 1994. Lookahead-Bounded Q-Learning van Hasselt, H. Double Q-learning. In Advances in Neural Information Processing Systems, pp. 2613 2621, 2010. Watkins, C. J. C. H. Learning from Delayed Rewards. Ph D thesis, King s College, Cambridge, UK, 1989. Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. ar Xiv preprint ar Xiv:1901.00210, 2019.