# qlearning_in_continuous_time__177ac68e.pdf Journal of Machine Learning Research 24 (2023) 1-61 Submitted 7/22; Revised 4/23; Published 5/23 q-Learning in Continuous Time Yanwei Jia jiayw1993@gmail.com Department of System Engineering and Engineering Management The Chinese University of Hong Kong Shatin, NT, Hong Kong Xun Yu Zhou xz2574@columbia.edu Department of Industrial Engineering and Operations Research & The Data Science Institute Columbia University New York, NY 10027, USA Editor: Marc Bellemare We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term (little) q-function . This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a q-learning theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms. Keywords: continuous-time reinforcement learning, policy improvement, q-function, martingale, on-policy and off-policy 1. Introduction A recent series of papers, Wang et al. (2020) and Jia and Zhou (2022a,b), aim at laying an overarching theoretical foundation for reinforcement learning (RL) in continuous time with continuous state space (diffusion processes) and possibly continuous action space. Specifically, Wang et al. (2020) study how to explore strategically (instead of blindly) by formulating an entropy-regularized, distribution-valued stochastic control problem for diffusion processes. Jia and Zhou (2022a) solve the policy evaluation (PE) problem, namely to learn the value function of a given stochastic policy, by characterizing it as a martingale problem. Jia and Zhou (2022b) investigate the policy gradient (PG) problem, that is, to c 2023 Yanwei Jia and Xun Yu Zhou. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0755.html. Jia and Zhou determine the gradient of a learned value function with respect to the current policy, and show that PG is mathematically a simpler PE problem and thus solvable by the martingale approach developed in Jia and Zhou (2022a). Combining these theoretical results naturally leads to various online and offline actor critic (AC) algorithms for general model-free (up to the diffusion dynamics) RL tasks, where one learns value functions and stochastic policies simultaneously and alternatingly. Many of these algorithms recover and/or interpret some well-known existing RL algorithms for Markov decision processes (MDPs) that were often put forward in a heuristic and ad hoc manner. PG updates a policy along the gradient ascent direction to improve it; so PG is an instance of the general policy improvement (PI) approach. On the other hand, one of the earliest and most popular methods for PI is Q-learning (Watkins, 1989; Watkins and Dayan, 1992), whose key ingredient is to learn the Q-function, a function of state and action. The learned Q-function is then maximized over actions at each state to achieve improvement of the current policy. The resulting Q-learning algorithms, such as the Q-table, SARSA and DQN (deep Q-network), are widely used for their simplicity and effectiveness in many applications.1 Most importantly though, compared with PG, one of the key advantages of using Q-learning is that it works both on-policy and off-policy (Sutton and Barto, 2018, Chapter 6). The Q-function, by definition, is a function of the current state and action, assuming that the agent takes a particular action at the current time and follows through a given control policy starting from the next time step. Therefore, it is intrinsically a notion in discrete time; that is why Q-learning has been predominantly studied for discrete-time MDPs. In a continuous-time setting, the Q-function collapses to the value function that is independent of action and hence cannot be used to rank and choose the current actions. Indeed, Tallec et al. (2019) opine that there is no Q-function in continuous time . On the other hand, one may propose discretizing time to obtain a discretized Q-function and then apply the existing Q-learning algorithms. However, Tallec et al. (2019) show experimentally that this approach is very sensitive to time discretization and performs poorly with small time steps. Kim et al. (2021) take a different approach: they include the action as a state variable in the continuous-time system by restricting the action process to be absolutely continuous in time with a bounded growth rate. Thus Q-learning becomes a policy evaluation problem with the state action pair as the new state variable. However, they consider only deterministic dynamic systems and discretize upfront the continuous-time problem. Crucially, that the action must be absolutely continuous is unpractically restrictive because optimal actions are often only measurable in time and have unbounded variations (such as the bang bang controls) even for deterministic systems. As a matter of fact, the aforementioned series of papers, Wang et al. (2020) and Jia and Zhou (2022a,b), are characterized by carrying out all the theoretical analysis within the continuous setting and taking observations at discrete times (for computing the to- 1. Q-learning is typically combined with various forms of function approximations. For example, linear function approximations (Melo et al., 2008; Zou et al., 2019), kernel-based nearest neighbor regression (Shah and Xie, 2018), or deep neural networks (Mnih et al., 2015; Fan et al., 2020). Q-learning is not necessarily restricted to finite action space in literature; for example, Q-learning with continuous action spaces is studied in Gu et al. (2016). However, Duan et al. (2016) report and claim that the standard Q-learning algorithm may become less efficient for continuous action spaces. Continuous-Time q-Learning tal reward over time) only when implementing the algorithms. Some advantages of this approach, compared with discretizing time upfront and then applying existing MDP results, are discussed extensively in Doya (2000); Wang et al. (2020); Jia and Zhou (2022a,b). More importantly, the pure continuous-time approach minimizes or eliminates the impact of time-discretization on learning which, as discussed above, becomes critical especially for continuous-time Q-learning. Now, if we are to study Q-learning strictly within the continuous-time framework without time-discretization or action restrictions, then the first question is what a proper Qfunction should be. When time is discretized, Baird (1993) and Mnih et al. (2016) define the so-called advantage function that is the difference in value between taking a particular action versus following a given policy at any state; that is, the difference between the state action value and the state value. Baird (1993, 1994) and Tallec et al. (2019) notice that such an advantage function can be properly scaled with respect to the size of time discretization whose continuous-time limit exists, called the (instantaneous) advantage rate function and can be learned. The advantage updating performs better than conventional Q-learning, as shown numerically in Tallec et al. (2019). However, these papers again consider only deterministic dynamic systems and do not fully develop a theory of this rate function. Here in the present paper, we attack the problem directly from the continuous-time perspective. In particular, we rename the advantage rate function as the (little) q-function.2 This qfunction in the finite-time episodic setting is a function of the time state action triple under a given policy, analogous to the conventional Q-function. However, it is a continuous-time notion because it does not depend on time-discretization. This feature is a vital advantage for learning algorithm design as it avoids the sensitivity with respect to the observation and intervention frequency (the step size of time-discretization). For the entropy-regularized, exploratory setting for diffusion processes (first formulated in Wang et al. 2020) studied in the paper, the q-function of a given policy turns out to be the Hamiltonian that includes the infinitesimal generator of the dynamic system and the instantaneous reward function (see Yong and Zhou 1999 for details), plus the temporal dispersion that consists of the time-derivative of the value function and the depreciation from discounting. The paper aims to develop a comprehensive q-learning theory around this q-function and accordingly design alternative RL algorithms other than the PG-based AC algorithms in Jia and Zhou (2022b). There are three main questions. The first is to characterize the q-function. For discrete-time MDPs, the PE method is used to learn both the value function and the Q-function because both functions are characterized by similar Bellman equations. By contrast, in the continuous-time diffusion setting, a Bellman equation is only available for the value function (also known as the Feynman Kac formula, which yields that the value function solves a second-order linear partial differential equation). The second question is to establish a policy improvement theorem based on the q-function and to design corresponding AC algorithms in a sample/data-driven, model-free fashion. The third question is whether the capability of both onand off-policy learning, a key advantage of Q-learning, is intact in the continuous-time setting. 2. We use the lower case letter q to denote this rate function, much in the same way as the typical use of the lower case letter f to denote a probability density function which is the first-order derivative of the cumulative distribution function usually denoted by the upper case letter F. Jia and Zhou We provide complete answers to all these questions. First, given a stochastic policy along with its (learned) value function, the corresponding q-function is characterized by the martingale condition of a certain stochastic process with respect to an enlarged filtration (information field) that includes both the original environmental noise and the action randomization. Alternatively, the value function and the q-function can be jointly determined by the martingality of the same process. These results suggest that the martingale perspective for PE and PG in Jia and Zhou (2022a,b) continues to work for q-learning. In particular, we devise several algorithms to learn these functions in the same way as the martingale conditions are employed to generate PE and PG algorithms in Jia and Zhou (2022a,b). Interestingly, a temporal difference (TD) learning algorithm designed from this perspective recovers and, hence, interprets a version of the well-known SARSA algorithm in the discrete-time Q-learning literature. Moreover, inspired by these continuous-time martingale characterizations, in Appendix A we present martingale conditions for Q-learning of discrete-time MDPs, which are new to our best knowledge. This demonstrates that the continuous-time RL study may offer new perspectives even in discrete time. Second, we prove that a Gibbs sampler with a properly scaled current q-function (independent of time discretization) as the exponent of its density function improves upon the current policy. This PI result translates into a policy-updating algorithm when the normalizing constant of the Gibbs measure can be explicitly computed. Otherwise, if the normalizing constant is unavailable, we prove a stronger PI theorem in terms of the Kullback Leibler (KL) divergence, analogous to a result for discrete-time MDPs in Haarnoja et al. (2018a). One of the algorithms out of this theorem happens to recover a PG-based AC algorithm in Jia and Zhou (2022b). Finally, we prove that the aforementioned martingale conditions hold for both the on-policy and off-policy settings. More precisely, the value function and the q-function associated with a given target policy can be learned based on a dataset generated by either the target policy itself or an arbitrary behavior policy. So, q-learning indeed works off-policy as well. The rest of the paper is organized as follows. In Section 2 we review Wang et al. (2020) s entropy-regularized, exploratory formulation for RL in continuous time and space, and present some useful preliminary results. In Section 3, we establish the q-learning theory, including the motivation and definition of the q-function and its on-/off-policy martingale characterizations. q-learning algorithms are developed, depending on whether or not the normalizing constant of the Gibbs measure is available, in Section 4 and Section 5, respectively. Extension to ergodic tasks is presented in Section 6. In Section 7, we illustrate the proposed algorithms and compare them with PG-based algorithms and time-discretized, conventional Q-learning algorithms on two specific examples with simulated data.3 Finally, Section 8 concludes. The Appendix contains various supplementary materials and proofs of statements in the main text. 2. Problem Formulation and Preliminaries Throughout this paper, by convention all vectors are column vectors unless otherwise specified, and Rk is the space of all k-dimensional vectors (hence k 1 matrices). We use Sk ++ to 3. The code to reproduce our simulation studies is publicly available at https://www.dropbox.com/sh/ 34cgnupnuaix15l/AAAj2y QYf NCOt PUc1_7Vhbk Ia?dl=0. Continuous-Time q-Learning denote all the k k symmetric and positive definite matrices. Given two matrices A and B of the same size, denote by A B their inner product, by |A| the Euclidean/Frobenius norm of A, by det A the determinant when A is a square matrix, and by A the transpose of any matrix A. For A Sk ++, we write A = UD1/2U , where A = UDU is its eigenvalue decomposition with U an orthogonal matrix, D a diagonal matrix, and D1/2 the diagonal matrix whose entries are the square root of those of D. We use f = f( ) to denote the function f, and f(x) to denote the function value of f at x. We denote by N(µ, Σ) the probability density function of the multivariate normal distribution with mean vector µ and covariance matrix Σ. Finally, for any stochastic process X = {Xs, s 0}, we denote by {FX s }s 0 the natural filtration generated by X. 2.1 Classical model-based formulation For readers convenience, we first recall the classical, model-based stochastic control formulation. Let d, n be given positive integers, T > 0, and b : [0, T] Rd A 7 Rd and σ : [0, T] Rd A 7 Rd n be given functions, where A Rm is the action/control space. The classical stochastic control problem is to control the state (or feature) dynamics governed by a stochastic differential equation (SDE), defined on a filtered probability space Ω, F, PW ; {FW s }s 0 along with a standard n-dimensional Brownian motion W = {Ws, s 0}: d Xa s = b(s, Xa s , as)ds + σ(s, Xa s , as)d Ws, s [0, T], (1) where as stands for the agent s action at time s. The goal of a stochastic control problem is, for each initial time state pair (t, x) [0, T) Rd of (1), to find the optimal {FW s }s 0-progressively measurable (continuous-time) sequence of actions a = {as, t s T} also called the optimal control or optimal strategy that maximizes the expected total discounted reward: t e β(s t)r(s, Xa s , as)ds + e β(T t)h(Xa T ) Xa t = x , (2) where r is an (instantaneous) reward function (i.e., the expected rate of reward conditioned on time, state, and action), h is the lump-sum reward function applied at the end of the planning period T, and β 0 is a constant discount factor that measures the time-value of the payoffor the impatience level of the agent. Note in the above, the state process Xa = {Xa s , t s T} also depends on (t, x). However, to ease notation, here and henceforth we use Xa instead of Xt,x,a = {Xt,x,a s , t s T} to denote the solution to SDE (1) with initial condition Xa t = x whenever no ambiguity may arise. The (generalized) Hamiltonian is a function H : [0, T] Rd A Rd Rd d R associated with problem (1) (2) defined as (Yong and Zhou, 1999): H(t, x, a, p, q) = b(t, x, a) p + 1 2σσ (t, x, a) q + r(t, x, a). (3) We make the same assumptions as in Jia and Zhou (2022b) to ensure theoretically the well-posedness of the stochastic control problem (1) (2). Jia and Zhou Assumption 1 The following conditions for the state dynamics and reward functions hold true: (i) b, σ, r, h are all continuous functions in their respective arguments; (ii) b, σ are uniformly Lipschitz continuous in x, i.e., for ϕ {b, σ}, there exists a constant C > 0 such that |ϕ(t, x, a) ϕ(t, x , a)| C|x x |, (t, a) [0, T] A, x, x Rd; (iii) b, σ have linear growth in x, i.e., for ϕ {b, σ}, there exists a constant C > 0 such that |ϕ(t, x, a)| C(1 + |x|), (t, x, a) [0, T] Rd A; (iv) r and h have polynomial growth in (x, a) and x respectively, i.e., there exist constants C > 0 and µ 1 such that |r(t, x, a)| C(1 + |x|µ + |a|µ), |h(x)| C(1 + |x|µ), (t, x, a) [0, T] Rd A. Classical model-based stochastic control theory has been well developed (e.g., Fleming and Soner 2006 and Yong and Zhou 1999) to solve the above problem, assuming that the functional forms of b, σ, r, h are all given and known. A centerpiece of the theory is the Hamilton Jacobi Bellman (HJB) equation t (t, x) + sup a A H t, x, a, V x (t, x), 2V x2 (t, x) βV (t, x) = 0, V (T, x) = h(x). (4) x Rd is the gradient and 2V x2 Rd d is the Hessian matrix. Under proper conditions, the unique solution (possibly in the sense of viscosity solution) to (4) is the optimal value function to the problem (1) (2), i.e., V (t, x) = sup {as,t s T} EPW Z T t e β(s t)r(s, Xa s , as)ds + e β(T t)h(Xa T ) Xa t = x . Moreover, the following function, which maps a time state pair to an action: a (t, x) = arg sup a A H t, x, a, V x (t, x), 2V x2 (t, x) (5) is the optimal (non-stationary, feedback) control policy of the problem. In view of the definition of the Hamiltonian (3), the maximization in (5) indicates that, at any given time and state, one ought to maximize a suitably weighted average of the (myopic) instantaneous reward and the risk-adjusted, (long-term) positive impact on the system dynamics. See Yong and Zhou (1999) for a detailed account of this theory and many discussions on its economic interpretations and implications. Continuous-Time q-Learning 2.2 Exploratory formulation in reinforcement learning We now present the RL formulation of the problem to be studied in this paper. In the RL setting, the agent has partial or simply no knowledge about the environment (i.e. the functions b, σ, r, h). What she can do is trial and error to try a (continuous-time) sequence of actions a = {as, t s T}, observe the corresponding state process Xa = {Xa s , t s T} along with both a stream of discounted running rewards {e β(s t)r(s, Xa s , as), t s T} and a discounted, end-of-period lump-sum reward e β(T t)h(Xa T ) where β is a given, known discount factor, and continuously update and improve her actions based on these observations.4 A critical question is how to strategically generate these trial-and-error sequences of actions. The idea is randomization; namely, the agent devises and employs a stochastic policy, which is a probability distribution on the action space, to generate actions according to the current time state pair. It is important to note that such randomization itself is independent of the underlying Brownian motion W, the random source of the original control problem that stands for the environmental noises. Specifically, assume the probability space is rich enough to support uniformly distributed random variables on [0, 1] that is independent of W, and then such a uniform random variable can be used to generate other random variables with density functions. Let {Zt, 0 t T} be a process of mutually independent copies of a uniform random variable on [0, 1], the construction of which requires a suitable extension of probability space; see Sun (2006) for details. We then further expand the filtered probability space to (Ω, F, P; {Fs}s 0) where Fs = FW s σ(Zt, 0 t s) and the probability measure P, now defined on FT , is an extension from PW (i.e. the two probability measures coincide when restricted to FW T ). Let π : (t, x) [0, T] Rd 7 π( |t, x) P(A) be a given (feedback) policy, where P(A) is a suitable collection of probability density functions.5 At each time s, an action as is generated or sampled from the distribution π( |s, Xs). Fix a stochastic policy π and an initial time state pair (t, x). Consider the following SDE d Xπ s = b(s, Xπ s , aπ s )ds + σ(s, Xπ s , aπ s )d Ws, s [t, T]; Xπ t = x (6) defined on (Ω, F, P; {Fs}s 0), where aπ = {aπ s , t s T} is an {Fs}s 0-progressively measurable action process generated from π. (6) is an SDE with random coefficients, whose well-posedness (i.e. existence and uniqueness of solution) is established in Yong and Zhou (1999, Chapter 1, Theorem 6.16) under Assumption 1. Fix aπ, the unique solution to (6), Xπ = {Xπ s , t s T}, is the sample state process corresponding to aπ that solves (1).6 Moreover, following Wang et al. (2020), we add an entropy regularizer to the reward 4. This procedure applies to both the offline and online settings. In the former, the agent can repeatedly try different sequences of actions over the full time period [0, T], record the corresponding state processes and payoffs, and then learn and update the policy based on the resulting dataset. In the latter, the agent updates the actions as she goes, based on all the up-to-date historical observations. 5. Here we assume that the action space A is continuous and randomization is restricted to those distributions that have density functions. The analysis and results of this paper can be easily extended to the cases of discrete action spaces and/or randomization with probability mass functions. 6. Here, Xπ also depends on the specific copy aπ sampled from π; however, to ease notation we write Xπ instead of Xπ,aπ. Jia and Zhou function to encourage exploration (represented by the stochastic policy), leading to J(t, x; π) =EP Z T t e β(s t) [r(s, Xπ s , aπ s ) γ log π(aπ s |s, Xπ s )] ds + e β(T t)h(Xπ T ) Xπ t = x , (7) where EP is the expectation with respect to both the Brownian motion and the action randomization, and γ 0 is a given weighting parameter on exploration also known as the temperature parameter. Wang et al. (2020) consider the following SDE d Xs = b s, Xs, π( |s, Xs) dt + σ s, Xs, π( |s, Xs) d Ws, s [t, T]; Xt = x, (8) b s, x, π( ) = Z A b(s, x, a)π(a)da, σ s, x, π( ) = A σσ (s, x, a)π(a)da. Intuitively, based on the law of large number, the solution of (8), denoted by { Xπ s , t s T}, is the limit of the average of the sample trajectories Xπ over randomization (i.e., copies of π). Rigorously, it follows from the property of Markovian projection due to Brunick and Shreve (2013, Theorem 3.6) that Xπ s and Xπ s have the same distribution for each s [t, T]. Hence, the value function (7) is identical to J(t, x; π) =EPW Z T t e β(s t) R(s, Xπ s , π( |s, Xπ s ))ds + e β(T t)h( Xπ T ) Xπ t = x . (9) where R(s, x, π) = R A[r(s, x, a) γ log π(a)]π(a)da. The function J( , ; π) is called the value function of the policy π, and the task of RL is to find J (t, x) = max π Π J(t, x; π), (10) where Π stands for the set of admissible (stochastic) policies. The following gives the precise definition of admissible policies. Definition 1 A policy π = π( | , ) is called admissible if (i) π( |t, x) P(A), supp π( |t, x) = A for every (t, x) [0, T] Rd, and π(a|t, x) : (t, x, a) [0, T] Rd A R is measurable; (ii) π(a|t, x) is continuous in (t, x) and uniformly Lipschitz continuous in x in the total variation distance, i.e., R A |π(a|t, x) π(a|u, x )|da 0 as (u, x ) (t, x), and there is a constant C > 0 independent of (t, a) such that Z A |π(a|t, x) π(a|t, x )|da C|x x |, x, x Rd; Continuous-Time q-Learning (iii) For any given α > 0, the entropy of π and its α-moment have polynomial growth in x, i.e., there are constants C = C(α) > 0 and µ = µ (α) 1 such that | R A log π(a|t, x)π(a|t, x)da| C(1 + |x|µ ), and R A |a|απ(a|t, x)da C(1 + |x|µ ) (t, x). Under Assumption 1 along with (i) and (ii) in Definition 1, Jia and Zhou (2022b, Lemma 2) show that the SDE (8) admits a unique strong solution since its coefficients are locally Lipschitz continuous and have linear growth; see Mao (2007, Chapter 2, Theorem 3.4) for the general result. In addition, the growth condition (iii) guarantees that the new payoff function (9) is finite. More discussions on the conditions required in the above definition can be found in Jia and Zhou (2022b). Moreover, Definition 1 only contains the conditions on the policy, hence they can be verified. Note that the problem (8) (9) is mathematically equivalent to the problem (6) (7). Nevertheless, they serve different purposes in our study: the former provides a framework for theoretical analysis of the value function, while the latter directly involves observable samples for algorithm design. 2.3 Some useful preliminary results Lemma 2 in Jia and Zhou (2022b) yields that the value function of a given admissible policy π = π( | , ) Π satisfies the following PDE t (t, x; π) + Z H t, x, a, J x(t, x; π), 2J x2 (t, x; π) γ log π(a|t, x) π(a|t, x)da βJ(t, x; π) = 0, J(T, x; π) = h(x). This is a version of the celebrated Feynman Kac formula in the current RL setting. Under Assumption 1 as well as the admissibility conditions in Definition 1, the PDE (11) admits a unique viscosity solution among polynomially growing functions (Beck et al., 2021, Theorem 1.1). On the other hand, Tang et al. (2022, Section 5) derive the following exploratory HJB equation for the problem (8) (9) satisfied by the optimal value function: t (t, x) + sup π P(A) H t, x, a, J x (t, x), 2J x2 (t, x) γ log π(a) π(a)da βJ (t, x; π) = 0, J (T, x) = h(x). (12) Moreover, the optimal (stochastic) policy is a Gibbs measure or Boltzmann distribution: π ( |t, x) exp 1 γ H t, x, , J x (t, x), 2J or, after normalization, π (a|t, x) = exp{ 1 γ H t, x, a, J x (t, x), 2J x2 (t, x) } R γ H t, x, a, J x (t, x), 2J x2 (t, x) }da . (13) Jia and Zhou This result shows that one should use a Gibbs sampler in general to generate trial-and-error strategies to explore the environment when the regularizer is chosen to be entropy.7 In the special case when the system dynamics are linear in action and payoffs are quadratic in action, the Hamiltonian is a quadratic function of action and Gibbs thus specializes to Gaussian under mild conditions. Plugging (13) to (12) to replace the supremum operator therein leads to the following equivalent form of the exploratory HJB equation t (t, x) + γ log Z γ H t, x, a, J x (t, x), 2J x2 (t, x) da βJ (t, x) = 0, J (T, x) = h(x). (14) More theoretical properties regarding (14) and consequently the problem (6) (7) including its well-posedness can be found in Tang et al. (2022). The Gibbs measure can also be used to derive a policy improvement theorem. Wang and Zhou (2020) prove such a theorem in the context of continuous-time mean variance portfolio selection, which is essentially a linear quadratic (LQ) control problem. The following extends that result to the general case. Theorem 2 For any given π Π, define π ( |t, x) exp{ 1 γ H t, x, , J x(t, x; π), 2J x2 (t, x; π) }. If π Π, then J(t, x; π ) J(t, x; π). Moreover, if the following map I(π) = exp{ 1 γ H t, x, , J x(t, x; π), 2J x2 (t, x; π) } R γ H t, x, a, J x(t, x; π), 2J x2 (t, x; π) }da , π Π has a fixed point π , then π is the optimal policy. At this point, Theorem 2 remains a theoretical result that cannot be directly applied to learning procedures, because the Hamiltonian H t, x, a, J x(t, x; π), 2J x2 (t, x; π) depends on the knowledge of the model parameters which we do not have in the RL context. To develop implementable algorithms, we turn to Q-learning. 3. q-Function in Continuous Time: The Theory This section is the theoretical foundation of the paper, with the analysis entirely in continuous time. We start with defining a Q-function parameterized by a time step t > 0, and then motivate the notion of q-function that is independent of t. We further provide martingale characterizations of the q-function. 7. We stress here that the Gibbs sampler is due to the entropy-regularizer, and it is possible to have other types of distribution for exploration. For example, Han et al. (2022) propose a family of regularizers that lead to exponential, uniform, and ε-greedy exploratory policies. O Donoghue (2021); Gao et al. (2020) study how to choose time and state dependent temperature parameters. One may also carry out exploration via randomizing value function instead of policies; see e.g. Osband et al. (2019). Continuous-Time q-Learning 3.1 Q-function Tallec et al. (2019) consider a continuous-time MDP and then discretize it upfront to a discrete-time MDP with time discretization δt. In that setting, the authors argue that there is no Q-function in continuous time because the Q-function collapses to the value function when δt is infinitesimally small. Here, we take an entirely different approach that does not involve time discretization. Incidentally, by comparing the form of the Gibbs measure (13) with that of the widely employed Boltzmann exploration for learning in discrete-time MDPs, Gao et al. (2020) and Zhou (2021) conjecture that the continuous-time counterpart of the Q-function is the Hamiltonian. Our approach will also provide a rigorous justification on and, indeed, a refinement of this conjecture. Given π Π and (t, x, a) [0, T) Rd A, consider a perturbed policy of π, denoted by π, as follows: It takes the action a A on [t, t + t) where t > 0, and then follows π on [t + t, T]. The corresponding state process X π, given X π t = x, can be broken into two pieces. On [t, t + t), it is Xa which is the solution to d Xa s = b(s, Xa s , a)ds + σ(s, Xa s , a)d Ws, s [t, t + t); Xa t = x, while on [t+ t, T], it is Xπ following (6) but with the initial time state pair (t+ t, Xa t+ t). With t > 0 fixed, we introduce the ( t-parameterized) Q-function, denoted by Q t(t, x, a; π), to be the expected reward drawn from the perturbed policy, π:8 Q t(t, x, a; π) t e β(s t)r(s, Xa s , a)ds t+ t e β(s t)[r(s, Xπ s , aπ s ) γ log π(aπ s |s, Xπ s )]ds + e β(T t)h(Xπ T ) X π t = x . Recall that our formulation (7) includes an entropy regularizer term that incentivizes exploration using stochastic policies. However, in defining Q t(t, x, a; π) above we do not include such a term on [t, t + t) for π because a deterministic constant action a is applied whose entropy is excluded. The following proposition provides an expansion of this Q-function in t. Proposition 3 We have Q t(t, x, a; π) =J(t, x; π) + J t (t, x; π) + H t, x, a, J x(t, x; π), 2J x2 (t, x; π) βJ(t, x; π) t + o( t). So, this t-based Q-function includes three terms. The leading term is the value function J, which is independent of the action a taken. This is natural because we only apply a for a time window of length t and hence the impact of this action is only of the order t. 8. In the existing literature, Q-learning with entropy regularization is often referred to as the soft Qlearning with the associated soft Q-function . A review of and more discussions on soft Q-learning in discrete time can be found in Appendix A as well as in Haarnoja et al. (2018b); Schulman et al. (2017). Jia and Zhou Next, the first-order term of t is the Hamiltonian plus the temporal change of the value function (consisting of the derivative of the value function in time and the discount term, both of which would disappear for stationary and non-discounted problems), in which only the Hamiltonian depends on the action a. It is interesting to note that these are the same terms in the Feynman Kac PDE (11), less the entropy term. Finally, the higher-order residual term, o( t), comes from the approximation error of the integral over [t, t+ t] and can be ignored when such errors are aggregated as t gets smaller. Proposition 3 yields that this Q-function also collapses to the value function when t 0, similar to what Tallec et al. (2019) claim. However, we must emphasize that our Q-function is not the same as the one used in Tallec et al. (2019), who discretize time upfront and study the resulting discrete-time MDP. The corresponding Q-function and value function therein exist based on the discrete-time RL theory, denoted respectively by Qδt, V δt where δt is the time discretization size. Then Tallec et al. (2019) argue that the advantage function Aδt = Qδt V δt can be properly scaled by δt, leading to the existence of limδt 0 Aδt δt . In short, the Q-function and the resulting algorithms in Tallec et al. (2019) are still in the realm of discrete-time MDPs. By contrast, Q t(t, x, a; π) introduced here is a continuous-time notion. It is the value function of a policy that applies a constant action in [t, t + t] and follows π thereafter. So t here is a parameter representing the length of period in which the constant action is taken, not the time discretization size as in Tallec et al. (2019). Most importantly, our Q-function is just a tool used to introduce the qfunction, the centerpiece of this paper. Having said all these, we recognize that limδt 0 Aδt δt as identified by Tallec et al. (2019) is the counterpart of our q-function in their setting, and that we arrive at this same object in two different ways (discretizing upfront then taking δt 0, versus a true continuous-time analysis). 3.2 q-function Since the leading term in the Q-function, Q t(t, x, a; π), coincides with the value function of π and hence can not be used to rank action a, we focus on the first-order approximation, which gives an infinitesimal state action value. Motivated by this, we define the q-function as follows: Definition 4 The q-function of the problem (6) (7) associated with a given policy π Π is defined as q(t, x, a; π) = J t (t, x; π) + H t, x, a, J x(t, x; π), 2J x2 (t, x; π) βJ(t, x; π), (t, x, a) [0, T] Rd A. (16) Clearly, this function is the first-order derivative of the Q-function with respect to t, as an immediate consequence of Proposition 3: Corollary 5 We have q(t, x, a; π) = lim t 0 Q t(t, x, a; π) J(t, x; π) Continuous-Time q-Learning Some remarks are in order. First, the q-function is a function of the time state action triple under a given policy, analogous to the conventional Q-function for MDPs; yet it is a continuous-time notion because it does not depend on any time-discretization. This feature is a vital advantage for learning algorithm design as Tallec et al. (2019) point out that the performance of RL algorithms is very sensitive with respect to the time discretization. Second, the q-function is related to the advantage function in the MDP literature (e.g., Baird 1993; Mnih et al. 2016), which describes the difference between the state action value and the state value. Here in this paper, the q-function reflects the instantaneous advantage rate of a given action at a given time state pair under a given policy. We also notice that in (16) only the Hamiltonian depends on a; hence the improved policy presented in Theorem 2 can also be expressed in terms of the q-function: π ( |t, x) exp 1 γ H t, x, , J x(t, x; π), 2J x2 (t, x; π) exp 1 γ q(t, x, ; π) . Consequently, if we can learn the q-function q( , , ; π) under any policy π, then it follows from Theorem 2 that we can improve π by implementing a Gibbs measure over the q-values, analogous to, say, ε-greedy policy in classical Q-learning (Sutton and Barto, 2018, Chapter 6). Finally, the analysis so far justifies the aforementioned conjecture about the proper continuous-time version of the Q-function, and provides a theoretical interpretation to the widely used Boltzmann exploration in RL. The main theoretical results of this paper are martingale characterizations of the qfunction, which can in turn be employed to devise algorithms for learning crucial functions including the q-function, in the same way as in applying the martingale approach for PE (Jia and Zhou, 2022a).9 The following first result characterizes the q-function associated with a given policy π, assuming that its value function has already been learned and known. Theorem 6 Let a policy π Π, its value function J and a continuous function ˆq : [0, T] Rd A R be given. Then (i) ˆq(t, x, a) = q(t, x, a; π) for all (t, x, a) [0, T] Rd A if and only if for all (t, x) [0, T] Rd, the following process e βs J(s, Xπ s ; π) + Z s t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du (18) is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) under π with Xπ t = x. (ii) If ˆq(t, x, a) = q(t, x, a; π) for all (t, x, a) [0, T] Rd A, then for any π Π and for all (t, x) [0, T] Rd, the following process e βs J(s, Xπ s ; π) + Z s t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du (19) is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) under π with initial condition Xπ t = x. 9. Some discrete-time counterparts of these results are presented in Appendix A. Jia and Zhou (iii) If there exists π Π such that for all (t, x) [0, T] Rd, (19) is an ({Fs}s 0, P)- martingale where Xπ t = x, then ˆq(t, x, a) = q(t, x, a; π) for all (t, x, a) [0, T] Rd A. Moreover, in any of the three cases above, the q-function satisfies Z q(t, x, a; π) γ log π(a|t, x) π(a|t, x)da = 0, (t, x) [0, T] Rd. (20) Jia and Zhou (2022a) unify and characterize the learning of value function (i.e., PE) by martingale conditions of certain stochastic processes. Theorem 6 shows that learning the q-function again boils down to maintaining the martingality of the processes (18) or (19). However, there are subtle differences between these martingale conditions. Jia and Zhou (2022a) consider only deterministic policies so the martingales therein are with respect to ({FW s }s 0, PW ). Jia and Zhou (2022b) extends the policy evaluation to include stochastic policies but the related martingales are in terms of the averaged state Xπ and hence also with respect to ({FW s }s 0, PW ). By contrast, the martingality in Theorem 6 is with respect to ({Ft}t 0, P), where the filtration {Ft}t 0 is the enlarged one that includes the randomness for generating actions. Note that the filtration determines the class of test functions necessary to characterize a martingale through the so-called martingale orthogonality conditions (Jia and Zhou, 2022a). So the above martingale conditions suggest that one ought to choose test functions dependent of the past and current actions when designing algorithms to learn the q-function. Moreover, the q-function can be interpreted as a compensator to guarantee the martingality over this larger information field.10 Theorem 6-(i) informs on-policy learning, namely, learning the q-function of the given policy π, called the target policy, based on data {(s, Xπ s , aπ s ), t s T} generated by π itself. Nevertheless, one of the key advantages of classical Q-learning for MDPs is that it also works for off-policy learning, namely, learning the q-function of a given target policy π based on data generated by a different admissible policy π , called a behavior policy. On-policy and off-policy reflect two different learning settings depending on the availability of data and/or the choice of a learning agent. When data generation can be controlled by the agent, conducting on-policy learning is possible although she could still elect off-policy learning. By contrast, when data generation is not controlled by the agent and she has to rely on data under other policies, then it becomes off-policy learning. Theorem 6-(ii) and -(iii) stipulate that our q-learning in continuous time can also be off-policy. Finally, (20) is a consistency condition to uniquely identify the q-function. If we only consider deterministic policies of the form a( , ) in which case γ = 0 and π( |t, x) degenerates into the Dirac measure concentrating on a(t, x), then (20) reduces to q(t, x, a(t, x); a) = 0, which corresponds to equation (23) in Tallec et al. (2019). The following result strengthens Theorem 6, in the sense that it characterizes the value function and the q-function under a given policy simultaneously. 10. More precisely, R t 0 e βsq(s, Xπ s , aπ s ; π)ds is the compensator of e βt J(t, Xπ t ; π) + R t 0 e βsr(s, Xπ s , aπ s )ds. Recall that a compensator of an adapted stochastic process Yt is a predictable process At such that Yt At is a local martingale. Intuitively speaking, the compensator is the drift part of a diffusion process, extracting the trend of the process. Continuous-Time q-Learning Theorem 7 Let a policy π Π, a function ˆJ C1,2 [0, T) Rd C [0, T] Rd with polynomial growth, and a continuous function ˆq : [0, T] Rd A R be given satisfying ˆJ(T, x) = h(x), Z ˆq(t, x, a) γ log π(a|t, x) π(a|t, x)da = 0, (t, x) [0, T] Rd. (21) (i) ˆJ and ˆq are respectively the value function and the q-function associated with π if and only if for all (t, x) [0, T] Rd, the following process e βs ˆJ(s, Xπ s ) + Z s t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du (22) is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) with Xπ t = x. (ii) If ˆJ and ˆq are respectively the value function and the q-function associated with π, then for any π Π and all (t, x) [0, T] Rd, the following process e βs ˆJ(s, Xπ s ) + Z s t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du (23) is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) under π with initial condition Xπ t = x. (iii) If there exists π Π such that for all (t, x) [0, T] Rd, (23) is an ({Fs}s 0, P)- martingale where Xπ t = x, then ˆJ and ˆq are respectively the value function and the q-function associated with π. Moreover, in any of the three cases above, if it holds further that π(a|t, x) = exp{ 1 γ ˆq(t,x,a)} R γ ˆq(t,x,a)}da, then π is the optimal policy and ˆJ is the optimal value function. Theorem 7 characterizes the value function and the q-function in terms of a single martingale condition, in each of the on-policy and off-policy settings, which will be the foundation for designing learning algorithms in this paper. The conditions in (21) ensure that ˆJ corresponds to the correct terminal payofffunction and ˆq corresponds to the soft q-function with the entropy regularizer. In particular, if the policy is the Gibbs measure generated by 1 γ ˆq, then ˆJ and ˆq are respectively the value function and the q-function under the optimal policy. Finally, note the (subtle) difference between Theorem 7 and Theorem 6. There may be multiple ( ˆJ, ˆq) pairs satisfying the martingale conditions of (22) or (23) in Theorem 7; so the conditions in (21) are required for identifying the correct value function and q-function.11 By contrast, these conditions are implied if the correct value function has already been known and given, as in Theorem 6. 11. If the terminal condition ˆJ(T, x) = h(x) is changed to ˆJ(T, x) = ˆh(x), then ( ˆJ, ˆq) satisfying the martingale condition (22) or (23) would be the value function and q-function corresponding to a different learning task with the pair of running and terminal reward functions being (r, ˆh). If the normalization condition on ˆq is missing, say R A ˆq(t, x, a) γ log π(a|t, x) π(a|t, x)da = ˆr(t, x) holds instead, then ( ˆJ, ˆq) would be the value function and q-function corresponding to the learning task with the running and terminal reward functions (r ˆr, h). Jia and Zhou 3.3 Optimal q-function We now focus on the q-function associated with the optimal policy π in (13). Based on Definition 4, it is defined as q (t, x, a) = J t (t, x) + H t, x, a, J x (t, x), 2J x2 (t, x) βJ (t, x), (24) where J is the optimal value function satisfying the exploratory HJB equation (14). Proposition 8 We have Z γ q (t, x, a)}da = 1, (25) for all (t, x), and consequently the optimal policy π is π (a|t, x) = exp{1 γ q (t, x, a)}. (26) As seen from its proof (in Appendix C), Proposition 8 is an immediate consequence of the exploratory HJB equation (14). Indeed, (25) is equivalent to (14) when viewed as an equation for J due to (24). However, in terms of q , satisfying (25) is only a necessary condition or a minimum requirement for being the optimal q-function, and by no means sufficient, in the current RL setting. This is because in the absence of knowledge about the primitives b, σ, r, h, we are unable to determine J from q by their relationship (24) which can be viewed as a PDE for J . To fully characterize J as well as q , we will have to resort to martingale condition, as stipulated in the following result. Theorem 9 Let a function c J C1,2 [0, T) Rd C [0, T] Rd with polynomial growth and a continuous function bq : [0, T] Rd A R be given satisfying c J (T, x) = h(x), Z γ bq (t, x, a)}da = 1, (t, x) [0, T] Rd. (27) (i) If c J and bq are respectively the optimal value function and the optimal q-function, then for any π Π and all (t, x) [0, T] Rd, the following process e βsc J (s, Xπ s ) + Z s t e βu[r(u, Xπ u , aπ u ) bq (u, Xπ u , aπ u )]du (28) is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) under π with Xπ t = x. Moreover, in this case, c π (a|t, x) = exp{ 1 γ bq (t, x, a)} is the optimal policy. (ii) If there exists one π Π such that for all (t, x) [0, T] Rd, (28) is an ({Fs}s 0, P)- martingale, then c J and bq are respectively the optimal value function and the optimal q-function. Continuous-Time q-Learning Theorem 9 lays a foundation for off-policy learning without having to go through iterative policy improvement. We emphasize here that to learn the optimal policy and the optimal q-function, it is infeasible to conduct on-policy learning because π is unknown and hence the constraints (21) can not be checked nor enforced. The new constraints in (27) no longer depend on policies and instead give basic requirements for the candidate value function and q-function. Maintaining the martingality of (28) under any admissible policy is equivalent to the optimality of the candidate value function and q-function. Moreover, Theorem 9-(ii) suggests that we do not need to use data generated by all admissible policies. Instead, those generated by one any one policy should suffice; e.g. we could take π(a|t, x) = exp{ 1 γ bq (t, x, a)} as the behavior policy. Finally, let us remark that Theorems 7 and 9 are independent of each other in the sense that one does not imply the other. One may design different algorithms from them depending on different cases and needs. 4. q-Learning Algorithms When Normalizing Constant Is Available This section and the next discuss algorithm design based on the theoretical results in the previous section. Theorems 7 and 9 provide the theoretical foundation for designing both on-policy and off-policy algorithms to simultaneously learn and update the value function (critic) and the policy (actor) with proper function approximations of J and q. In this section, we present these actor critic, q-learning algorithms when the normalizing constant involved in the Gibbs measure is available or computable.12 4.1 q-Learning algorithms Given a policy π, to learn its associated value function and q-function, we denote by Jθ and qψ the parameterized function approximators that satisfy the two constraints: Jθ(T, x) = h(x), Z qψ(t, x, a) γ log π(a|t, x) π(a|t, x)da = 0, (29) for all θ Θ RLθ, ψ Ψ RLψ. Then Theorem 2 suggests that the policy can be improved by πψ(a|t, x) = exp{ 1 γ qψ(t, x, a)} R γ qψ(t, x, a)}da, (30) while by assumption the normalizing constant R γ qψ(t, x, a)}da for any ψ Ψ can be explicitly computed. Next, we can use the data generated by πψ and apply the martingale condition in Theorem 7 to learn its value function and q-function, leading to an iterative actor critic algorithm. 12. Conventional Q-learning or SARSA is often said to be value-based in which the Q-function is a state action value function serving as a critic. In q-learning, as Theorem 6 stipulates, the q-function is uniquely determined by the value function J as its compensator. Hence, q plays a dual role here: it is both a critic (as it can be derived endogenously from the value function) and an actor (as it derives an improved policy). This is why we call the q-learning algorithms actor critic, if with a slight abuse of the term because the actor here is not purely exogenous as with, say, policy gradient. Jia and Zhou Alternatively, we may choose to directly learn the optimal value function and q-function based on Theorem 9. In this case, the approximators Jθ and qψ should satisfy Jθ(T, x) = h(x), Z γ qψ(t, x, a)}da = 1. (31) Note that if the policy π in (29) is taken in the form (30), then the second equation in (31) is automatically satisfied. Henceforth in this section, we focus on deriving algorithms based on Theorem 9 and hence always impose the constraints (31). In this case, any approximator of the q-function directly gives that of the policy via πψ(a|t, x) = exp{ 1 γ qψ(t, x, a)}; thereby we avoid learning the q-function and the policy separately. Moreover, making use of (31) typically results in more special parametric form of the q-function approximator qψ, potentially facilitating more efficient learning. Here is an example. Example 1 When the system dynamic is linear in action a and reward is quadratic in a, the Hamiltonian, and hence the q-function, is quadratic in a. So we can parameterize qψ(t, x, a) = 1 2qψ 2 (t, x) a qψ 1 (t, x) a qψ 1 (t, x) +qψ 0 (t, x), (t, x, a) [0, T] Rd Rm, with qψ 1 (t, x) Rm, qψ 2 (t, x) Sm ++ and qψ 0 (t, x) R. The corresponding policy is a multivariate normal distribution for which the normalizing constant can be computed: πψ( |t, x) = N qψ 1 (t, x), γ qψ 2 (t, x) 1 , with its entropy value being 1 2 log det qψ 2 (t, x) + m 2 log 2πeγ. The second constraint on qψ in (31) then yields qψ 0 (t, x) = γ 2 log det qψ 2 (t, x) mγ 2 log 2π. This in turn gives rise to a more specific parametric form qψ(t, x, a) = 1 2qψ 2 (t, x) a qψ 1 (t, x) a qψ 1 (t, x) + γ 2 log det qψ 2 (t, x) mγ The next step in algorithm design is to update (θ, ψ) by enforcing the martingale condition stipulated in Theorem 9 and applying the techniques developed in Jia and Zhou (2022a). A number of algorithms can be developed based on two types of objectives: to minimize the martingale loss function or to satisfy the martingale orthogonality conditions. The latter calls for solving a system of equations for which there are further two different techniques: applying stochastic approximation as in the classical temporal difference (TD) algorithms or minimizing a quadratic function in lieu of the system of equations as in the generalized methods of moment (GMM). For reader s convenience, we summarize these methods in the q-learning context below. Minimize the martingale loss function: e β(T t)h(Xπψ T ) Jθ(t, Xπψ t ) + Z T t e β(s t)[r(s, Xπψ s , aπψ s ) qψ(s, Xπψ s , aπψ s )]ds Continuous-Time q-Learning This method is intrinsically offline because the loss function involves the whole horizon [0, T]; however we are free to choose optimization algorithms to update the parameters (θ, ψ). For example, we can apply stochastic gradient decent to update θ (t, Xπψ t )Gt:T dt t e β(s t) qψ ψ (s, Xπψ s , aπψ s )ds Gt:T dt, where Gt:T = e β(T t)h(Xπψ T ) Jθ(t, Xπψ t )+ R T t e β(s t)[r(s, Xπψ s , aπψ s ) qψ(s, Xπψ s , aπψ s )]ds, and αθ and αψ are the learning rates. We present Algorithm 1 based on this updating rule. Note that this algorithm is analogous to the classical gradient Monte Carlo method or TD(1) for MDPs (Sutton and Barto, 2018) because full sample trajectories are used to compute gradients. Choose two different test functions ξt and ζt that are both Ft-adapted vector-valued stochastic processes, and consider the following system of equations in (θ, ψ): 0 ξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i = 0, 0 ζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i = 0. To solve these equations iteratively, we use stochastic approximation to update (θ, ψ) either offline by13 0 ξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i , 0 ζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i , or online by θ θ + αθξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i , ψ ψ + αψζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i . Typical choices of test functions are ξt = Jθ θ (t, Xπψ t ), ζt = qψ ψ (t, Xπψ t , aπψ t ) yielding algorithms that would be closest to TD-style policy evaluation and Q-learning (e.g., in Tallec et al. 2019) and at the same time belong to the more general semi-gradient methods (Sutton and Barto, 2018). We present the online and offline q-learning algorithms, Algorithms 2 and 3 respectively, based on these test functions. We must, however, 13. Here, when implementing in a computer program, d Jθ is the timestep-wise difference in Jθ when time is discretized, or indeed it is the temporal-difference (TD) of the value function approximator Jθ. Jia and Zhou stress that testing against these two specific functions is theoretically not sufficient to guarantee the martingale condition. Using them with a rich, large-dimensional parametric family for J and q makes approximation to the martingale condition finer and finer. Moreover, the corresponding stochastic approximation algorithms are not guaranteed to converge in general, and the test functions have to be carefully selected (Jia and Zhou, 2022a) depending on (J, q). Some of the different test functions leading to different types of algorithms in the context of policy evaluation are discussed in Jia and Zhou (2022a). Choose the same types of test functions ξt and ζt as above but now minimize the GMM objective functions: 0 ξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i 0 ξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i , 0 ζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i 0 ζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i , where Aθ SLθ ++, Aψ SLψ ++. Typical choices of these matrices are Aθ = ILθ and Aψ = ILψ, or Aθ = (EP[ R T 0 ξtξ t dt]) 1 and Aψ = (EP[ R T 0 ζtζ t dt]) 1. Again, we refer the reader to Jia and Zhou (2022a) for discussions on these choices and the connection with the classical GTD algorithms and GMM method. 4.2 Connections with SARSA With a fixed time-step size t, we can define our Q-function, Q t(t, x, a; π), by (15), parameterize it by Qϕ t(t, x, a; π), and then apply existing (big) Q-learning algorithms such as SARSA (cf., Sutton and Barto 2018, Chapter 10) to learn the parameter ϕ. Now, how do we compare this approach of t-based Q-learning with that of q-learning? Equation (15) suggests that Q t(t, x, a; π) can be approximated by Q t(t, x, a; π) J(t, x; π) + q(t, x, a; π) t. Our q-learning method is to learn separately the zeroth-order and first-order terms of Q t(t, x, a; π) in t, and the two terms are in themselves independent of t. Our approach therefore has a true continuous-time nature without having to rely on t or any time disretization, which theoretically facilitates the analysis carried out in the previous section and algorithmically mitigates the high sensitivity with respect to time discretization. Our approach is similar to that introduced in Baird (1994) and Tallec et al. (2019) where the value function and the (rescaled) advantage function are approximated separately. Continuous-Time q-Learning Algorithm 1 Offline Episodic q-Learning ML Algorithm Inputs: initial state x0, horizon T, time step t, number of episodes N, number of mesh grids K, initial learning rates αθ, αψ and a learning rate schedule function l( ) (a function of the number of episodes), functional forms of parameterized value function Jθ( , ) and q-function qψ( , , ) satisfying (31), and temperature parameter γ. Required program (on-policy): environment simulator (x , r) = Environment t(t, x, a) that takes current time state pair (t, x) and action a as inputs and generates state x at time t + t and instantaneous reward r at time t as outputs. Policy πψ(a|t, x) = exp{ 1 γ qψ(t, x, a)}. Required program (off-policy): observations {atk, rtk, xtk+1}k=0, ,K 1 {xt K, h(xt K)} = Observation( t) including the observed actions, rewards, and state trajectories under the given behavior policy at the sampling time grids with step size t. Learning procedure: Initialize θ, ψ. for episode j = 1 to N do Initialize k = 0. Observe initial state x0 and store xtk x0. {On-policy case while k < K do Generate action atk πψ( |tk, xtk). Apply atk to environment simulator (x, r) = Environment t(tk, xtk, atk), and observe new state x and reward r as output. Store xtk+1 x and rtk r. Update k k + 1. end while } {Off-policy case Obtain one observation {atk, rtk, xtk+1}k=0, ,K 1 {xt K, h(xt K)} = Observation( t). } For every k = 0, 1, , K 1, compute Gtk:T = e β(T tk)h(xt K) Jθ(tk, xtk) + i=k e β(ti tk)[rti qψ(ti, xti, ati)] t. Update θ and ψ by θ θ + l(j)αθ θ (tk, xtk)Gtk:T t. ψ ψ + l(j)αψ ψ (tk, xtk, atk) t Jia and Zhou Algorithm 2 Offline Episodic q-Learning Algorithm Inputs: initial state x0, horizon T, time step t, number of episodes N, number of mesh grids K, initial learning rates αθ, αψ and a learning rate schedule function l( ) (a function of the number of episodes), functional forms of parameterized value function Jθ( , ) and q-function qψ( , , ) satisfying (31), functional forms of test functions ξ(t, x t, a t) and ζ(t, x t, a t), and temperature parameter γ. Required program (on-policy): environment simulator (x , r) = Environment t(t, x, a) that takes current time state pair (t, x) and action a as inputs and generates state x at time t + t and instantaneous reward r at time t as outputs. Policy πψ(a|t, x) = exp{ 1 γ qψ(t, x, a)}. Required program (off-policy): observations {atk, rtk, xtk+1}k=0, ,K 1 {xt K, h(xt K)} = Observation( t) including the observed actions, rewards, and state trajectories under the given behavior policy at the sampling time grids with step size t. Learning procedure: Initialize θ, ψ. for episode j = 1 to N do Initialize k = 0. Observe initial state x0 and store xtk x0. {On-policy case while k < K do Generate action atk πψ( |tk, xtk). Apply atk to environment simulator (x, r) = Environment t(tk, xtk, atk), and observe new state x and reward r as outputs. Store xtk+1 x and rtk r. Update k k + 1. end while } {Off-policy case Obtain one observation {atk, rtk, xtk+1}k=0, ,K 1 {xt K, h(xt K)} = Observation( t). } For every k = 0, 1, , K 1, compute and store test functions ξtk = ξ(tk, xt0, , xtk, at0, , atk), ζtk = ζ(tk, xt0, , xtk, at0, , atk). Compute i=0 ξti Jθ(ti+1, xti+1) Jθ(ti, xti) + rti t qψ(ti, xti, ati) t βJθ(ti, xti) t , i=0 ζti Jθ(ti+1, xti+1) Jθ(ti, xti) + rti t qψ(ti, xti, ati) t βJθ(ti, xti) t . Update θ and ψ by θ θ + l(j)αθ θ. ψ ψ + l(j)αψ ψ. Continuous-Time q-Learning Algorithm 3 Online-Incremental q-Learning Algorithm Inputs: initial state x0, horizon T, time step t, number of mesh grids K, initial learning rates αθ, αψ and learning rate schedule function l( ) (a function of the number of episodes), functional forms of parameterized value function Jθ( , ) and q-function qψ( , , ) satisfying (31), functional forms of test functions ξ(t, x t, a t) and ζ(t, x t, a t), and temperature parameter γ. Required program (on-policy): environment simulator (x , r) = Environment t(t, x, a) that takes current time state pair (t, x) and action a as inputs and generates state x at time t + t and instantaneous reward r at time t as outputs. Policy πψ(a|t, x) = exp{ 1 γ qψ(t, x, a)}. Required program (off-policy): observations {a, r, x } = Observation(t, x; t) including the observed actions, rewards, and state when the current time-state pair is (t, x) under the given behavior policy at the sampling time grids with step size t. Learning procedure: Initialize θ, ψ. for episode j = 1 to do Initialize k = 0. Observe initial state x0 and store xtk x0. while k < K do {On-policy case Generate action atk πψ( |tk, xtk). Apply atk to environment simulator (x, r) = Environment t(tk, xtk, atk), and observe new state x and reward r as outputs. Store xtk+1 x and rtk r. } {Off-policy case Obtain one observation atk, rtk, xtk+1 = Observation(tk, xtk; t). } Compute test functions ξtk = ξ(tk, xt0, , xtk, at0, , atk), ζtk = ζ(tk, xt0, , xtk, at0, , atk). Compute δ = Jθ(tk+1, xtk+1) Jθ(tk, xtk) + rtk t qψ(tk, xtk, atk) t βJθ(tk, xtk) t, Update θ and ψ by θ θ + l(j)αθ θ. ψ ψ + l(j)αψ ψ. Update k k + 1 end while end for Jia and Zhou However, as pointed out already, as much as the two approaches may lead to certain similar algorithms, they are different conceptually. The ones in Baird (1994) and Tallec et al. (2019) are still based on discrete-time MDPs and hence depend on the size of time-discretization. By contrast, the value function and q-function in q-learning are well-defined quantities in continuous time and independent of any time-discretization. On the other hand, approximating the value function by Jθ and the q-function by qψ separately yields a specific approximator of the Q-function by Qθ,ψ t (t, x, a) Jθ(t, x) + qψ(t, x, a) t. (32) With this Q-function approximator Qθ,ψ, one of our q-learning based algorithms actually recovers (a modification of) SARSA, one of the most well-known conventional Q-learning algorithm. To see this, consider state Xt at time t. Then the conventional SARSA with the approximator Qθ,ψ updates parameters (θ, ψ) by Qθ,ψ t (t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t Qθ,ψ t (t, Xt, aπψ t ) + r(t, Xt, aπψ t ) t βQθ,ψ t (t, Xt, aπψ t ) t Qθ,ψ t θ (t, Xt, aπψ t ), Qθ,ψ t (t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t Qθ,ψ t (t, Xt, aπψ t ) + r(t, Xt, aπψ t ) t βQθ,ψ t (t, Xt, aπψ t ) t Qθ,ψ t ψ (t, Xt, aπψ t ), (33) where aπψ t πψ( |t, Xt) and aπψ t+ t πψ( |t + t, Xat t+ t). Note that by (32), we have Qθ,ψ t θ Jθ t θ , Qθ,ψ t ψ qψ t ψ t. This means that Qθ,ψ t ψ has an additional order of t when compared with Qθ,ψ t θ , resulting in a much slower rate of update on ψ than θ. This observation is also made by Tallec et al. (2019), who suggest as a remedy modifying the update rule for Q-function by replacing the partial derivative of the Q-function with that of the advantage function. In our setting, this means we change Qθ,ψ t ψ to qψ t ψ in the second update rule of (33). Continuous-Time q-Learning Now, the terms in the brackets in (33) can be further rewritten as: Qθ,ψ t (t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t Qθ,ψ t (t, Xt, aπψ t ) + r(t, Xt, aπψ t ) t βQθ,ψ t (t, Xt, aπψ t ) t =Jθ(t + t, Xaπψ t t+ t) Jθ(t, Xt) + [qψ(t + t, Xaπψ t t+ t, aπψ t+ t) qψ(t, Xt, aπψ t )] t + r(t, Xt, aπψ t ) t βJθ(t, Xt) t βqψ(t, Xt, aπψ t )( t)2 γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t =Jθ(t + t, Xaπψ t t+ t) Jθ(t, Xt) qψ(t, Xt, aπψ t ) t + r(t, Xt, aπψ t ) t βJθ(t, Xt) t + qψ(t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t βqψ(t, Xt, aπψ t )( t)2 d Jθ(t, Xt) qψ(t, Xt, aπψ t )dt + r(t, Xt, aπψ t )dt βJθ(t, Xt)dt + qψ(t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) | {z } mean-0 term due to (29) where in the last step we drop the higher-order small term ( t)2 while replacing the difference terms with differential ones. Comparing (33) with the updating rule in Algorithm 3 using test functions ξt = Jθ θ (t, Xπψ t ) and ζt = qψ ψ (t, Xπψ t , aπψ t ), we find that the modified SARSA in Tallec et al. (2019) and Algorithm 3 differ only by a term whose mean is 0 due to the constraint (29) on the qfunction. This term is solely driven by the action randomization at time t + t. Hence, the algorithm in Tallec et al. (2019) is noisier and may be slower in convergence speed compared with the q-learning one. In Section 7, we will numerically compare the results of the above Q-learning SARSA and q-learning. 5. q-Learning Algorithms When Normalizing Constant Is Unavailable Theorem 7 assumes that the normalizing constant in the Gibbs measure is available so that one can enforce the constraint (21). So does Theorem 9 because the second constraint in (27) is exactly to verify the normalizing constant to be 1. But it is well known that in most high-dimensional cases computing this constant is a daunting, and often impossible task. 5.1 A stronger policy improvement theorem To overcome the difficulty arising from an unavailable normalizing constant in soft Qlearning, Haarnoja et al. (2018b) propose a general method of using a family of stochastic policies whose densities can be easily computed to approximate π . In our current setting, denote by {πφ( |t, x)}φ Φ the family of density functions of some tractable distributions, e.g., the multivariate normal distribution πφ( |t, x) = N µφ(t, x), Σφ(t, x) , where µφ(t, x) Rm and Σφ(t, x) Sm ++. The learning procedure then proceeds as follows: start- ing from a policy πφ within this family, project the desired policy exp{ 1 γ q(t,x, ;πφ)} R γ q(t,x,a;πφ)}da to Jia and Zhou the set {πφ(a|t, x)}φ Φ by minimizing min φ Φ DKL πφ ( |t, x) exp{ 1 γ q(t, x, ; πφ)} R γ q(t, x, a; πφ)}da min φ Φ DKL πφ ( |t, x) exp{ 1 γ q(t, x, ; πφ)} , where DKL f g := R g(a)f(a)da is the Kullback Leibler (KL) divergence of two positive functions f, g with the same support on A, where f P(A) is a probability density function on A. This procedure may not eventually lead to the desired target policy, but the newly defined policy still provably improves the previous policy, as indicated in the following stronger version of the policy improvement theorem that generalizes Theorem 2. Theorem 10 Given (t, x) [0, T] Rd, if two policies π Π and π Π satisfy π ( |t, x) exp{1 γ H t, x, , J x(t, x; π), 2J x2 (t, x; π) } π( |t, x) exp{1 γ H t, x, , J x(t, x; π), 2J x2 (t, x; π) } , then J(t, x; π ) J(t, x; π). Theorem 10 in itself is a general result comparing two given policies not necessarily within any tractable family of densities. However, an implication of the result is that, given a current tractable policy πφ, as long as we update it to a new tractable policy πφ with πφ ( |t, x) exp{1 γ q(t, x, ; πφ)} DKL πφ( |t, x) exp{1 γ q(t, x, ; πφ)} , then πφ improves upon πφ. Thus, to update πφ it suffices to solve the optimization problem min φ Φ DKL πφ ( |t, x) exp{1 γ q(t, x, ; πφ)} log πφ (a|t, x) 1 γ q(t, x, a; πφ) πφ (a|t, x)da. The gradient of the above objective function in φ is log πφ (a|t, x) 1 γ q(t, x, a; πφ) πφ (a|t, x)da log πφ (a|t, x) 1 γ q(t, x, a; πφ) πφ (a|t, x) φ log πφ (a|t, x) πφ (a|t, x)da log πφ (a|t, x) 1 γ q(t, x, a; πφ) φ log πφ (a|t, x) πφ (a|t, x)da, where we have noted Z φ log πφ (a|t, x) πφ (a|t, x)da = Z φ πφ (a|t, x)da = φ A πφ (a|t, x)da = 0. Continuous-Time q-Learning Therefore, we may update φ each step incrementally by φ φ γαφdt log πφ(aπφ t |t, Xt) 1 γ q(t, Xt, aπφ t ; πφ) φ log πφ(aπφ t |t, Xt), (34) where we choose the step size to be specifically γαφdt, for the reason that will become evident momentarily. The above updating rule seems to indicate that we would need to learn the q-function q(t, x, a; πφ) associated with the policy πφ. This is doable in view of Theorem 7, noting that {πφ( |t, x)}φ Φ is a tractable family of distributions so that the constraint (21) or (29) can be computed and enforced. However, we do not need to do so given that (34) only requires the q-value along the trajectory {(t, Xt, aπφ t ); 0 t T}, instead of the full functional form q( , , ; πφ). The former is easier to computed from the temporal difference learning, as will be evident in the next subsection. 5.2 Connections with policy gradient Applying Itˆo s lemma to J( , ; π) and recalling Definition 4, we have the following important relation between the q-function and the temporal difference of the value function: q(t, Xπ t , aπ t ; π)dt = d J(t, Xπ t ; π) + r(t, Xπ t , aπ t )dt βJ(t, Xπ t ; π)dt + { }d Wt. (35) Plugging the above to (34) and ignoring the martingale difference term { }d Wt whose mean is 0, the gradient-based updating rule for φ becomes φ φ + αφ h γ log πφ(aπφ t |t, Xπφ t )dt + d J(t, Xπφ t ; πφ) + r(t, Xπφ t , aπφ t )dt βJ(t, Xπφ t ; πφ)dt i φ log πφ(aπφ t |t, Xπφ t ). (36) In this updating rule, we only need to approximate J( , ; πφ), which is a problem of PE. Specifically, denote by Jθ( , ) the function approximator of J( , ; πφ), where θ Θ, and we may apply any PE methods developed in Jia and Zhou (2022a) to learn Jθ. Given this value function approximation, the updating rule (36) further specializes to φ φ + αφ h γ log πφ(aπφ t |t, Xπφ t )dt + d Jθ(t, Xπφ t ) + r(t, Xπφ t , aπφ t )dt βJθ(t, Xπφ t )dt i φ log πφ(aπφ t |t, Xπφ t ). (37) Expression (37) coincides with the updating rule in the policy gradient method established in Jia and Zhou (2022b) when the regularizer therein is taken to be the entropy. Moreover, since we learn the value function and the policy simultaneously, the updating rule leads to actor critic type of algorithms. Finally, with this method, while learning the value function may involve martingale conditions as in Jia and Zhou (2022a), learning the policy does not. Schulman et al. (2017) note the equivalence between soft Q-learning and PG in discrete time. Here we present the continuous-time counterpart of the equivalence, nevertheless with a theoretical justification, which recovers the PG based algorithms in Jia and Zhou (2022b) when combined with suitable PE methods. Jia and Zhou 6. Extension to Ergodic Tasks We now extend the previous study to ergodic tasks, in which the objective is to maximize the long-run average in the infinite time horizon [0, ), the functions b, σ and r do not depend on time t explicitly, and h = 0. The set of admissible (stationary) policies can be similarly defined in a straighforward manner. The regularized ergodic objective function is the long-run average: lim inf T 1 T EPW Z T A [r( Xπ s , a) γ log π(a| Xπ s )]π(a| Xπ s )dads Xπ t = x = lim inf T 1 T EP Z T t [r(Xπ s , aπ s ) γ log π(aπ s |Xπ s )]ds Xπ t = x , where γ 0 is the temperature parameter. We first present the ergodic version of the Feynman Kac formula and the corresponding policy improvement theorem. Lemma 11 Let π = π( | ) be a given admissible (stationary) policy.14 Suppose there is a function J( ; π) C2(Rd) with polynomial growth and a scalar V (π) R satisfying x(x; π), 2J x2 (x; π) γ log π(a|x) π(a|x)da V (π) = 0, x Rd. (38) Then for any t 0, V (π) = lim inf T 1 T EPW R T t R A[r( Xπ s , a) γ log π(a| Xπ s )]π(a| Xπ s )dads Xπ t = x = lim inf T 1 T EP R T t [r(Xπ s , aπ s ) γ log π(aπ s |Xπ s )]ds Xπ t = x . (39) Moreover, if there are two admissible policies π and π such that for all x Rd, π ( |x) exp{1 x(x; π), 2J x2 (x; π) } π( |x) exp{1 x(x; π), 2J x2 (x; π) } , then V (π ) V (π). We emphasize that the solution to (38) is a pair of (J, V ), where J is a function of the state and V R is a scalar. The long term average of the payoffdoes not depend on the initial state x nor the initial time t due to ergodicity, and hence is a constant as (39) implies. The function J, on the other hand, represents the first-order approximation of the long-run average and is not unique. Indeed, for any constant c, (J + c, V ) is also a solution to (38). We refer to V as the value of the underlying problem and still refer to J as the 14. For such ergodic tasks, typically an admissible policies also require the process Xπ is ergodic (cf. Meyn and Tweedie 1993). Continuous-Time q-Learning value function. Lastly, since the value does not depend on the initial time, we will fix the latter as 0 in the following discussions and applications of ergodic learning tasks. As with the episodic case, for any admissible policy π, we can define the t-parameterized Q-function as Q t(x, a; π) =EPW Z t+ t t [r(Xa s , a) V (π)]ds + lim T EP Z T t+ t [r(Xπ s , aπ s ) γ log π(aπ s |Xπ s ) V (π)]ds|Xa t+ t Xπ t = x =EPW Z t+ t t [r(Xa s , a) V (π)]ds + J(Xa t+ t; π) Xπ t = x EPW lim T EP J(Xπ T ; π)|Xa t+ t Xπ t = x =EPW Z t+ t t [r(Xa s , a) V (π)]ds + J(Xa t+ t; π) J(x; π) Xπ t = x + J(x; π) EPW lim T EP J(Xπ T ; π)|Xa t+ t Xπ t = x =EPW Z t+ t H Xa s , a, J x(Xa s ; π), 2J x2 (Xa s ; π) V (π) ds Xπ t = x + J(x; π) EPW lim T EP J(Xπ T ; π)|Xa t+ t Xπ t = x =J(x; π) + H x, a, J x(x; π), 2J x2 (x; π) V (π) t J + O ( t)2 , where we assumed that lim T EP J(Xπ T ; π)|Xπ t = x = J is a constant for any (t, x) [0, + ) Rd. The corresponding q-function is then defined as q(x, a; π) = H x, a, J x(x; π), 2J x2 (x; π) V (π). As a counterpart to Theorem 7, we have the following theorem to characterize the value function, the value, and the q-function for both on-policy and off-policy ergodic learning problems. A counterpart to Theorem 9 can be similarly established, and we leave details to the reader. Theorem 12 Let an admissible policy π, a number ˆV , a function ˆJ C2 Rd with polynomial growth, and a continuous function ˆq : Rd A R be given satisfying lim T 1 T E[ ˆJ(Xπ T )] = 0, Z ˆq(x, a) γ log π(a|x) π(a|x)da = 0, x Rd. (40) Jia and Zhou (i) ˆV , ˆJ and ˆq are respectively the value, the value function and the q-function associated with π if and only if for all x Rd, the following process ˆJ(Xπ t ) + Z t 0 [r(Xπ s , aπ s ) ˆq(Xπ s , aπ s ) ˆV ]ds (41) is an ({Ft}t 0, P)-martingale, where {Xπ t , 0 t < } is the solution to (6) with Xπ 0 = x. (ii) If ˆV , ˆJ and ˆq are respectively the value, value function and the q-function associated with π, then for any admissible π and any x Rd, the following process ˆJ(Xπ t ) + Z t 0 [r(Xπ u , aπ u ) ˆq(Xπ u , aπ u ) ˆV ]du (42) is an ({Ft}t 0, P)-martingale, where {Xπ t , 0 t < is the solution to (6) under π with initial condition Xπ 0 = x. (iii) If there exists an admissible π such that for all x Rd, (42) is an ({Ft}t 0, P)- martingale where Xπ 0 = x, then ˆV , ˆJ and ˆq are respectively the value, value function and the q-function associated with π. Moreover, in any of the three cases above, if it holds further that π(a|x) = exp{ 1 γ ˆq(x,a)} R γ ˆq(x,a)}da, then π is the optimal policy and ˆV is the optimal value. Based on Theorem 12, we can design corresponding q-learning algorithms for ergodic problems, in which we learn V, Jθ and qψ at the same time. These algorithms have natural connections with SARSA as well as the PG-based algorithms developed in Jia and Zhou (2022b). The discussions are similar to those in Subsections 4.2 and 5.2 and hence are omitted here. An online algorithm for ergodic tasks is presented as an example; see Algorithm 4. 7. Applications 7.1 Mean variance portfolio selection We first review the formulation of the exploratory mean variance portfolio selection problem, originally proposed by Wang and Zhou (2020) and later revisited by Jia and Zhou (2022b).15 The investment universe consists of one risky asset (e.g., a stock index) and one risk-free asset (e.g., a saving account) whose risk-free interest rate is r. The price of the risky asset {St, 0 t T} is governed by a geometric Brownian motion with drift µ and volatility σ > 0, defined on a filtered probability space (Ω, F, PW ; {FW t }0 t T ). An agent has a fixed investment horizon 0 < T < and an initial endowment x0. A self-financing portfolio is represented by the real-valued adapted process a = {at, 0 t T}, where at 15. There is a vast literature on classical model-based (non-exploratory) continuous-time mean variance models formulated as stochastic control problems; see e.g. Zhou and Li (2000); Lim and Zhou (2002); Zhou and Yin (2003) and the references therein. Continuous-Time q-Learning Algorithm 4 q-Learning Algorithm for Ergodic Tasks Inputs: initial state x0, time step t, initial learning rates αθ, αψ, αV and learning rate schedule function l( ) (a function of time), functional forms of the parameterized value function Jθ( ) and q-function qψ( , ) satisfying (40), functional forms of test functions ξ(t, x t, a t), ζ(t, x t, a t), and temperature parameter γ. Required program (on-policy): an environment simulator (x , r) = Environment t(x, a) that takes initial state x and action a as inputs and generates a new state x at t and an instantaneous reward r as outputs. Policy πψ(a|x) = exp{ 1 γ qψ(x,a)} R γ qψ(x,a)}da. Required program (off-policy): observations {a, r, x } = Observation(x; t) including the observed actions, rewards, and state when the current state is x under the given behavior policy at the sampling time grids with step size t. Learning procedure: Initialize θ, ψ, V . Initialize k = 0. Observe the initial state x0 and store xtk x0. loop {On-policy case Generate action a πψ( |x). Apply a to environment simulator (x , r) = Environment t(x, a), and observe new state x and reward r as outputs. Store xtk+1 x . } {Off-policy case Obtain one observation atk, rtk, xtk+1 = Observation(xtk; t). } Compute test functions ξtk = ξ(tk, xt0, , xtk, at0, , atk), ζtk = ζ(tk, xt0, , xtk, at0, , atk). Compute δ = Jθ(x ) Jθ(x) + r t qψ(x, a) t V t, Update θ, V and ψ by θ θ + l(k t)αθ θ, V V + l(k t)αV V, ψ ψ + l(k t)αψ ψ. Update x x and k k + 1. end loop is the discounted dollar value invested in the risky asset at time t. Then her discounted wealth process follows d Xt = at[(µ r)dt + σd Wt] = at d(e rt St) e rt St , X0 = x0. Jia and Zhou The agent aims to minimize the variance of the discounted value of the portfolio at time T while maintaining the expected return to be a certain level; that is, min a Var(Xa T ), subject to E[Xa T ] = z, (43) where z is the target expected return, and the variance and expectation throughout this subsection are with respect to the probability measure PW . The exploratory formulation of this problem with entropy regularizer is equivalent to J(t, x; w) = max π E ( Xπ T w)2 γ Z T t log π(aπ s |s, Xπ)ds Xπ t = x + (w z)2, (44) d Xπ s = (µ r) Z R aπ(a|s, Xπ s )dads + σ R a2π(a|s, Xπ s )dad Ws; Xπ t = x, where w is the Lagrange multiplier introduced to relax the expected return constraint; see Wang and Zhou (2020) for a derivation of this formulation. Note here we artificially add the minus sign to transform the variance minimization to a maximization problem to be consistent with our general formulation. We now follow Wang and Zhou (2020) and Jia and Zhou (2022b) to parameterize the value function by Jθ(t, x; w) = (x w)2e θ3(T t) + θ2(t2 T 2) + θ1(t T) (w z)2. Moreover, we parameterize the q-function by qψ(t, x, a; w) = e ψ1 ψ3(T t) 2 a + ψ2(x w) 2 γ 2[log 2πγ + ψ1 + ψ3(T t)], which is derived to satisfy the constraint (29), as explained in Example 1. The policy associated with this parametric q-function is πψ( |x; w) = N( ψ2(x w), γeψ1+ψ3(T t)). In addition to θ = (θ1, θ2, θ3) and ψ = (ψ1, ψ2, ψ3) , we also learn the Lagrange multiplier w by stochastic approximation in the same way as in Wang and Zhou (2020) and Jia and Zhou (2022b). The full algorithm is summarized in Algorithm 5.16 We then compare by simulations the performances of three learning algorithms: the q-learning based Algorithm 5 in this paper, the PG-based Algorithm 4 in Jia and Zhou (2022b), and the t-parameterized Q-learning algorithm presented in Appendix B. We conduct simulations with the following configurations: µ {0, 0.1, 0.3, 0.5}, σ {0.1, 0.2, 0.3, 0.4}, T = 1, x0 = 1, z = 1.4. Other tuning parameters in all the algorithms are chosen as γ = 0.1, m = 10, αθ = αψ = 0.001, αw = 0.005, and l(j) = 1 j0.51 . To have more realistic scenarios, we generate 20 years of training data and compare the three algorithms with the same dataset for N = 20, 000 episodes with a batch size 32. More precisely, 32 trajectories with length T are drawn from the training set to update the parameters to 16. Here we present an offline algorithm as example. Online algorithms can also be devised following the general study in the previous sections. Continuous-Time q-Learning Algorithm 5 Offline Episodic q-Learning Mean Variance Algorithm Inputs: initial state x0, horizon T, time step t, number of episodes N, number of time grids K, initial learning rates αθ, αψ, αw and learning rate schedule function l( ) (a function of the number of episodes), and temperature parameter γ. Required program: a market simulator x = Market t(t, x, a) that takes current timestate pair (t, x) and action a as inputs and generates state x at time t + t. Learning procedure: Initialize θ, ψ, w. for episode j = 1 to N do Initialize k = 0. Observe the initial state x and store xtk x. while k < K do Generate action atk πψ( |tk, xtk; w). Compute and store the test function ξtk = Jθ θ (tk, xtk; w), ζtk = qψ ψ (tk, xtk, atk; w). Apply atk to the market simulator x = Market t(tk, xtk, atk), and observe the output new state x. Store xtk+1. Update k k + 1. end while Store the terminal wealth X(j) T xt K. Compute i=0 ξti Jθ(ti+1, xti+1; w) Jθ(ti, xti; w) qψ(ti, xti, ati; w) t , i=0 ζti Jθ(ti+1, xti+1; w) Jθ(ti, xti; w) qψ(ti, xti, ati; w) t . Update θ and ψ by θ θ + l(j)αθ θ. ψ ψ + l(j)αψ ψ. Update w (Lagrange multiplier) every m episodes: if j 0 mod m then i=j m+1 X(i) T . end if end for Jia and Zhou be learned for N times. After training completes, for each market configuration we apply the learned policy out-of-sample repeatedly for 100 times and compute the average mean, variance and Sharpe ratio for each algorithm. We are particularly interested in the impact of time discretization; so we experiment with three different time discretization steps: t { 1 25, 1 250, 1 2500}. Note that all the algorithms rely on time discretization at the implementation stage, where t is the sampling frequency required for execution or computation of numerical integral for PG and q-learning. For the Q-learning, t plays a dual role: it is both the parameter in its definition and the time discretization size in implementation. The numerical results are reported in Tables 1 3, each table corresponding to a different value of t . We observe that for any market configuration, the Q-learning is almost always the worst performer in terms of all the metrics, and even divergent in certain high volatility and low return environment (when keeping the same learning rate as in the other cases). The other two algorithms have very close overall performances, although the q-learning tends to outperform in high volatile market environments. On the other hand, notably, the results change only slightly with different time discretizations, for all the three algorithms. This observation does not align with the claim in Tallec et al. (2019) that the classical Qlearning is very sensitive to time-discretization, at least for this specific application problem. In addition, we notice that q-learning and PG-based algorithm produce comparable results under most market scenarios, while the terminal variance produced by q-learning tends to be significantly higher than that by PG when |µ| is small or σ is large. In such cases, the ground truth solution requires higher leverage in risky assets in order to reach the high target expected return (40% annual return). However, it appears that overall q-learning strives to learn to keep up with the high target return and as a result ends up with high terminal variance in those cases. By contrast, the PG tends to maintain lower variance at the cost of significantly underachieving the target return. 7.2 Ergodic linear quadratic control Consider the ergodic LQ control problem where state responds to actions in a linear way d Xt = (AXt + Bat)dt + (CXt + Dat)d Wt, X0 = x0, and the goal is to maximize the long term average quadratic payoff lim inf T 1 T E Z T 0 r(Xt, at)dt|X0 = x0 with r(x, a) = (M 2 x2 + Rxa + N 2 a2 + Px + Qa). The exploratory formulation of this problem with entropy regularizer is equivalent to lim inf T 1 T E Z T 0 r(Xπ t , aπ t )dt γ log π(aπ t |Xπ t )dt Xπ 0 = x0 This problem falls into the general formulation of an ergodic task studied in Section 6; so we directly implement Algorithm 4 in our simulation. In particular, our function approximators for J and q are quadratic functions: Jθ(x) = θ1x2 + θ2x, qψ(x, a) = e ψ3 2 (a ψ1x ψ2)2 γ 2(log 2πγ + ψ3), Continuous-Time q-Learning Table 1: Out-of-sample performance comparison in terms of mean, variance and Sharpe ratio when data are generated by geometric Brownian motion with t = 1 25. We compare three algorithms: Policy Gradient described in Algorithm 4 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 5. The first two columns specify market configurations (parameters of the geometric Brownian motion simulated). Mean , Variance , and SR refer respectively to the average mean, variance, and Sharpe ratio of the terminal wealth under the corresponding learned policy over 100 independent runs. We highlight the largest average Sharpe ratio in bold. NA refers to the case where the algorithm diverges. The discretization size is t = 1 µ σ Policy Gradient Q-Learning q-Learning Mean Variance SR Mean Variance SR Mean Variance SR -0.5 0.1 1.409 0.003 7.134 1.409 0.004 6.663 1.408 0.004 6.805 -0.3 0.1 1.421 0.012 3.880 1.409 0.012 3.707 1.412 0.012 3.762 -0.1 0.1 1.382 0.094 1.285 1.327 0.070 1.263 1.351 0.081 1.272 0.0 0.1 1.090 0.216 0.193 1.098 0.302 0.201 1.108 0.321 0.202 0.1 0.1 1.312 0.147 0.835 1.265 0.108 0.827 1.292 0.133 0.831 0.3 0.1 1.420 0.016 3.312 1.402 0.016 3.181 1.407 0.016 3.224 0.5 0.1 1.405 0.004 6.422 1.403 0.005 6.026 1.403 0.004 6.147 -0.5 0.2 1.417 0.014 3.526 1.416 0.016 3.308 1.416 0.016 3.377 -0.3 0.2 1.445 0.060 1.915 1.428 0.059 1.841 1.434 0.059 1.867 -0.1 0.2 1.335 0.341 0.608 1.402 0.801 0.628 1.405 0.589 0.631 0.0 0.2 1.053 0.539 0.070 1.168 7.995 0.100 1.136 3.422 0.100 0.1 0.2 1.248 0.437 0.380 1.349 2.046 0.411 1.341 1.005 0.412 0.3 0.2 1.445 0.083 1.634 1.422 0.078 1.580 1.431 0.081 1.600 0.5 0.2 1.413 0.018 3.173 1.412 0.020 2.992 1.411 0.019 3.050 -0.5 0.3 1.433 0.039 2.305 1.432 0.043 2.180 1.432 0.041 2.223 -0.3 0.3 1.458 0.162 1.247 1.468 0.196 1.214 1.484 0.217 1.229 -0.1 0.3 1.226 0.569 0.335 1.624 8.959 0.415 1.556 5.558 0.415 0.0 0.3 1.030 0.777 0.036 NA NA NA 1.187 23.269 0.066 0.1 0.3 1.151 0.670 0.202 1.539 13.741 0.271 1.482 13.321 0.271 0.3 0.3 1.448 0.231 1.046 1.475 0.322 1.042 1.485 0.313 1.053 0.5 0.3 1.431 0.048 2.073 1.428 0.052 1.972 1.429 0.050 2.008 -0.5 0.4 1.457 0.093 1.679 1.460 0.102 1.608 1.462 0.102 1.638 -0.3 0.4 1.417 0.346 0.877 1.670 2.363 0.898 1.602 1.176 0.905 -0.1 0.4 1.147 0.805 0.205 NA NA NA 1.711 20.852 0.306 0.0 0.4 1.017 0.893 0.016 NA NA NA 1.211 76.040 0.048 0.1 0.4 1.084 0.825 0.109 NA NA NA 1.539 25.371 0.196 0.3 0.4 1.408 0.379 0.747 NA NA NA 1.609 1.992 0.776 0.5 0.4 1.446 0.102 1.510 1.456 0.125 1.455 1.461 0.128 1.479 Jia and Zhou Table 2: Out-of-sample performance comparison in terms of mean, variance and Sharpe ratio when data are generated by geometric Brownian motion with t = 1 250. We compare three algorithms: Policy Gradient described in Algorithm 4 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 5. The first two columns specify market configurations (parameters of the geometric Brownian motion simulated). Mean , Variance , and SR refer respectively to the average mean, variance, and Sharpe ratio of the terminal wealth under the corresponding learned policy over 100 independent runs. We highlight the largest average Sharpe ratio in bold. NA refers to the case where the algorithm diverges. The discretization size is t = 1 250. µ σ Policy Gradient Q-Learning q-Learning Mean Variance SR Mean Variance SR Mean Variance SR -0.5 0.1 1.409 0.003 7.130 1.407 0.004 6.671 1.407 0.004 6.805 -0.3 0.1 1.419 0.012 3.878 1.407 0.012 3.710 1.410 0.012 3.762 -0.1 0.1 1.380 0.092 1.285 1.325 0.069 1.264 1.347 0.079 1.272 0 0.1 1.092 0.218 0.201 1.096 0.253 0.201 1.106 0.306 0.202 0.1 0.1 1.314 0.148 0.835 1.267 0.109 0.827 1.293 0.133 0.831 0.3 0.1 1.424 0.017 3.311 1.406 0.017 3.183 1.411 0.017 3.224 0.5 0.1 1.411 0.004 6.421 1.409 0.005 6.032 1.409 0.004 6.147 -0.5 0.2 1.415 0.014 3.524 1.414 0.016 3.312 1.413 0.015 3.377 -0.3 0.2 1.439 0.057 1.915 1.423 0.056 1.843 1.428 0.056 1.867 -0.1 0.2 1.343 0.329 0.620 1.369 0.460 0.628 1.390 0.493 0.631 0 0.2 1.054 0.546 0.070 1.173 9.036 0.100 1.128 2.519 0.100 0.1 0.2 1.241 0.413 0.372 1.350 1.356 0.411 1.340 0.925 0.412 0.3 0.2 1.450 0.084 1.634 1.426 0.079 1.581 1.435 0.081 1.600 0.5 0.2 1.420 0.018 3.173 1.418 0.020 2.995 1.418 0.019 3.050 -0.5 0.3 1.428 0.037 2.304 1.426 0.041 2.183 1.426 0.039 2.223 -0.3 0.3 1.467 0.171 1.248 1.454 0.169 1.215 1.466 0.178 1.229 -0.1 0.3 1.230 0.552 0.326 1.661 10.321 0.415 1.495 2.852 0.415 0 0.3 1.031 0.845 0.036 1.236 38.868 0.066 1.184 21.710 0.066 0.1 0.3 1.142 0.716 0.181 1.541 16.335 0.266 1.456 6.433 0.271 0.3 0.3 1.466 0.230 1.064 1.470 0.260 1.043 1.486 0.281 1.053 0.5 0.3 1.438 0.049 2.073 1.434 0.053 1.974 1.435 0.051 2.008 -0.5 0.4 1.451 0.085 1.679 1.448 0.090 1.611 1.449 0.087 1.638 -0.3 0.4 1.424 0.281 0.890 1.535 0.638 0.898 1.544 0.612 0.905 -0.1 0.4 1.153 0.777 0.192 NA NA NA 1.747 22.299 0.306 0 0.4 1.019 1.000 0.021 NA NA NA 1.233 77.875 0.048 0.1 0.4 1.091 1.006 0.110 NA NA NA 1.696 48.274 0.200 0.3 0.4 1.400 0.384 0.704 1.685 2.737 0.771 1.595 1.189 0.776 0.5 0.4 1.463 0.114 1.510 1.463 0.122 1.457 1.466 0.121 1.479 Continuous-Time q-Learning Table 3: Out-of-sample performance comparison in terms of mean, variance and Sharpe ratio when data are generated by geometric Brownian motion with t = 1 2500. We compare three algorithms: Policy Gradient described in Algorithm 4 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 5. The first two columns specify market configurations (parameters of the geometric Brownian motion simulated). Mean , Variance , and SR refer respectively to the average mean, variance, and Sharpe ratio of the terminal wealth under the corresponding learned policy over 100 independent runs. We highlight the largest average Sharpe ratio in bold. NA refers to the case where the algorithm diverges. The discretization size is t = 1 2500. µ σ Policy Gradient Q-Learning q-Learning Mean Variance SR Mean Variance SR Mean Variance SR -0.5 0.1 1.408 0.003 7.130 1.406 0.004 6.677 1.406 0.004 6.805 -0.3 0.1 1.418 0.012 3.878 1.406 0.012 3.713 1.409 0.012 3.762 -0.1 0.1 1.376 0.090 1.285 1.324 0.069 1.264 1.345 0.077 1.272 0.0 0.1 1.092 0.218 0.201 1.096 0.259 0.201 1.105 0.300 0.202 0.1 0.1 1.316 0.149 0.835 1.269 0.111 0.827 1.294 0.133 0.831 0.3 0.1 1.425 0.017 3.311 1.407 0.017 3.185 1.412 0.017 3.224 0.5 0.1 1.412 0.004 6.420 1.410 0.005 6.037 1.410 0.004 6.147 -0.5 0.2 1.413 0.014 3.524 1.411 0.016 3.315 1.411 0.015 3.377 -0.3 0.2 1.435 0.056 1.915 1.419 0.055 1.844 1.424 0.056 1.867 -0.1 0.2 1.328 0.295 0.632 1.366 0.455 0.629 1.384 0.486 0.631 0.0 0.2 1.059 0.610 0.074 1.174 9.894 0.100 1.125 2.521 0.100 0.1 0.2 1.254 0.468 0.380 1.360 1.867 0.411 1.340 0.952 0.412 0.3 0.2 1.450 0.084 1.634 1.428 0.080 1.582 1.436 0.081 1.600 0.5 0.2 1.422 0.018 3.173 1.419 0.020 2.998 1.419 0.019 3.050 -0.5 0.3 1.425 0.036 2.304 1.422 0.040 2.185 1.422 0.039 2.223 -0.3 0.3 1.459 0.169 1.248 1.449 0.167 1.216 1.459 0.177 1.229 -0.1 0.3 1.225 0.597 0.327 1.690 11.223 0.415 1.487 2.943 0.415 0.0 0.3 1.039 0.857 0.044 1.206 36.278 0.066 1.183 23.798 0.066 0.1 0.3 1.169 0.745 0.202 1.457 14.390 0.266 1.462 8.590 0.271 0.3 0.3 1.455 0.215 1.064 1.488 0.356 1.044 1.486 0.296 1.053 0.5 0.3 1.438 0.049 2.073 1.435 0.053 1.975 1.436 0.051 2.008 -0.5 0.4 1.446 0.084 1.679 1.443 0.089 1.612 1.444 0.086 1.638 -0.3 0.4 1.395 0.231 0.873 1.529 0.619 0.899 1.537 0.622 0.905 -0.1 0.4 1.175 0.846 0.229 NA NA NA 1.754 24.745 0.306 0.0 0.4 1.026 0.967 0.025 NA NA NA 1.170 1.66E+09 0.048 0.1 0.4 1.110 0.973 0.129 NA NA NA 1.661 59.872 0.200 0.3 0.4 1.402 0.416 0.720 1.646 3.402 0.771 1.607 1.743 0.776 0.5 0.4 1.456 0.108 1.510 1.469 0.135 1.458 1.467 0.128 1.479 Jia and Zhou Figure 1: Running average rewards of three RL algorithms. A single state trajectory is generated with length T = 106 and discretized at t = 0.1 to which three online algorithms apply: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 4. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation. Two dashed horizontal lines are respectively the omniscient optimal average reward without exploration when the model parameters are known and the omniscient optimal average reward less the exploration cost. where the form of the q-function is derived to satisfy the constraint (29), as discussed in Example 1. The policy associated with this parameterized q-function is πψ( |x) = N(ψ1x+ ψ2, γeψ3). In addition, we have another parameter V that stands for the long-term average. We then compare our online q-learning algorithm against two theoretical benchmarks and two other online algorithms, in terms of the running average reward during the learning process. The first benchmark is the omniscient optimal level, namely, the maximum long term average reward that can be achieved with perfect knowledge about the environment and reward (and hence exploration is unnecessary and only deterministic policies are considered). The second benchmark is the omniscient optimal level less the exploration cost, which is the maximum long term average reward that can be achieved by the agent who is however forced to explore under entropy regularization. The other two algorithms are the PG-based ergodic algorithm proposed in (Jia and Zhou, 2022b, Algorithm 3) and the t-based Q-learning algorithm presented in Appendix B. In our simulation, to ensure the stationarity of the controlled state process, we set A = 1, B = C = 0 and D = 1. Moreover, we set x0 = 0, M = N = Q = 2, R = P = 1, and γ = 0.1. Learning rates for all algorithms are initialized as αψ = 0.001, and decay Continuous-Time q-Learning (a) The path of learned ψ1. (b) The path of learned ψ2. (c) The path of learned eψ3. Figure 2: Paths of learned parameters of three RL algorithms. A single state trajectory is generated with length T = 106 and discretized at t = 0.1 to which three online algorithms apply: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 4. All the policies are restricted to be in the parametric form of πψ( |x) = N(ψ1x+ψ2, eψ3). The omniscient optimal policy is ψ 1 0.354, ψ 2 0.708, eψ 3 0.035, shown in the dashed line. We repeat the experiments for 100 times for each method and plot as the shaded area the standard deviation of the learned parameters. The width of each shaded area is twice the corresponding standard deviation. (b) t = 0.1 (c) t = 0.01 Figure 3: Running average rewards of three RL algorithms with different time discretization sizes. A single state trajectory is generated with length T = 105 and discretized at different step sizes: t = 1 in (a), t = 0.1 in (b), and t = 0.01 in (c). For each step size, we apply three online algorithms: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q Learning described in Algorithm 4. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation. according to l(t) = 1 max{1, log t}. We also repeat the experiment for 100 times under different seeds to generate samples. First, we fix t = 0.1 and run the three learning algorithms for sufficiently long time T = 106. All the parameters to be learned are initialized as 0 except for the initial variance of the policies which is set to be 1. Figure 1 plots the running average reward trajectories, along with their standard deviations (which are all too small to be visible in the figure), Jia and Zhou as the learning process proceeds. Among the three methods, both the PG and q-learning algorithms perform significantly better and converge to the optimal level much faster than the Q-learning. The q-learning algorithm is indeed slightly better than the PG one. Figure 2 shows the iterated values of the leaned parameters for the three algorithms. Eminently, the Q-learning has the slowest convergence. The other two algorithms, while converging at similar speed eventually, exhibit quite different behaviors on their ways to convergence in learning ψ1 and ψ2. The PG seems to learn these parameters faster than the q-learning at the initial stage, but then quickly overshoot the optimal level and take a long time to correct it, while the q-learning appears much smoother and does not have such an issue. At last, we vary the t value to examine its impact on the performances of the three algorithms. We vary t {1, 0.1, 0.01} and set T = 105 for each experiment. The learning rates and parameter initializations are fixed and the same for all the there methods. We repeat the experiment for 100 times and present the results in terms of the average running reward in Figure 3. We use the shaded area to denote the standard deviation. It is clear that the Q-learning algorithm is very sensitive to t. In particular, its performance worsens significantly as t becomes smaller. When t = 0.01, the algorithm almost has no improvement at all over time. This drawback of sensitivity in t is consistent with the observations made in Tallec et al. (2019). On the other hand, the other two algorithms show remarkable robustness across different discretization sizes. 7.3 Off-policy ergodic linear quadratic control The experiments reported in Subsection 7.2 are for on-policy learning. In this subsection we revisit the problem but assuming that we now have to work off-policy. Specifically, we have a sufficiently long observation of the trajectories of state, action and reward generated under a behavior policy that is not optimal. In our experiment, we take the behavior policy to be at N(0, 1). We still examine the three algorithms: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 4, with different time discretization sizes. Because off-policy learning is often necessitated in cases where online interactions are costly (Uehara et al., 2022), we restrict ourselves to an offline dataset generated under the behavior policy. We vary t {1, 0.1, 0.01} and set T = 106 for each experiment. We repeat the experiments for 100 times whose results are presented in Figure 4. Because it is generally hard or impossible to implement the learned policy and observe the resulting state and reward data in an offpolicy setting, we focus on the learned parameters of the policy to see if they converge to the desired optimal value. The horizontal axis in Figure 4 stands for the number of iterations that are used to update the policy parameters. In general, policy gradient methods are not applicable in the off-policy setting. This is confirmed by our experiments which show that the corresponding algorithm diverges quickly in all the cases. Q-learning still suffers from the sensitivity with respect to time discretization. The q-learning algorithm is the most stable one and converges in all scenarios. It is worth noting that in Figure 4-(a), the Q-learning and q-learning algorithms converge to the same limits, which are however different from the true optimal parameter values of the continuous-time problem. This is because the theoretical integrals involved in the value Continuous-Time q-Learning functions are not well approximated by finite sums with coarse time discretization ( t = 1); hence the corresponding martingale conditions under the learned parameters are not close to the theoretical continuous-time martingale conditions. (b) t = 0.1 (c) t = 0.01 Figure 4: Paths of learned parameters of three RL algorithms with different time discretization sizes. A single state trajectory is generated with length T = 106 and discretized at different step sizes: t = 1 in (a), t = 0.1 in (b), and t = 0.01 in (c), under the behavior policy at N(0, 1). From top to bottom are the paths of learned ψ1, ψ2, eψ3 respectively. For each step size, we apply three algorithms: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 4. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation. 8. Conclusion The previous trilogy on continuous-time RL under the entropy-regularized exploratory diffusion process framework (Wang et al., 2020; Jia and Zhou, 2022a,b) have respectively studied three key elements: optimal samplers for exploration, PE and PG. PG is a special Jia and Zhou instance of policy improvement, and in the discrete-time MDP literature, it can be done through Q-function. However, there was no satisfying Q-learning theory in continuous time, so a different route was taken in Jia and Zhou (2022b) to turn the PG task into a PE problem. This paper fills this gap and adds yet another essential building block to the theoretical foundation of continuous-time RL, by developing a q-learning theory commensurate with the continuous-time setting, attacking the general policy improvement problem, and covering both onand off-policy learning. Although the conventional Q-function provides no information about actions in continuous time, its first-order approximation, which we call the q-function, does. Moreover, it turns out that the essential component of the q-function relevant to policy updating is the Hamiltonian associated with the problem, the latter having already played a vital role in the classical model-based stochastic control theory. This fact, together with the expression of the optimal stochastic policies explicitly derived in Wang et al. (2020), in turn, explains and justifies the widely used Boltzmann exploration in classical RL for MDPs. We characterize the q-function as the compensator to maintain the martingality of a process consisting of the value function and cumulative reward, both in the on-policy and offpolicy settings. This characterization enables us to use the martingale-based PE techniques developed in Jia and Zhou (2022a) to simultaneously learn the q-function and the value function. Interestingly, the martingality is on the same process as in PE (Jia and Zhou, 2022a) but with respect to an enlarged filtration containing the policy randomization. The TD version of the resulting algorithm links to the well-known SARSA algorithm in classical Q-learning. A significant and outstanding research question is an analysis on the convergence rates of q-learning algorithms. The existing convergence rate results for discrete-time Q-learning cannot be extended to continuous-time q-learning due to the continuous state space and general nonlinear function approximations, along with the fact that the behaviors of the Qfunction and the q-function can be fundamentally different. A possible remedy is to carry out the convergence analysis within the general framework of stochastic approximation, which is poised to be the subject of a substantial future study. We make a final observation to conclude this paper. The classical model-based approach typically separates estimation and optimization , a la Wonham s separation theorem (Wonham, 1968) or the plug-in method. This approach first learns a model (including formulating and estimating its coefficients/parameters) and then optimizes over the learned model. Model-free (up to some basic dynamic structure such as diffusion processes or Markov chains) RL takes a different route: it skips estimating a model and learns optimizing policies directly via PG or Q/q-learning. The q-learning theory in this paper suggests that, to learn policies, one needs to learn the q-function or the Hamiltonian. In other words, it is the Hamiltonian which is a specific aggregation of the model coefficients, rather than each and every individual model coefficient, that needs to be learned/estimated for optimization. Clearly, from a pure computational standpoint, estimating a single function the Hamiltonian is much more efficient and robust than estimating multiple functions (b, σ, r, h in the present paper) in terms of avoiding or reducing over-parameterization, sensitivity to errors and accumulation of errors. More importantly, however, (35) hints that the Hamiltonian/q-function can be learned through temporal differences of the value Continuous-Time q-Learning function, so the task of learning and optimizing can be accomplished in a data-driven way. This would not be the case if we chose to learn individual model coefficients separately. This observation, we believe, highlights the fundamental differences between the model-based and model-free approaches. Acknowledgments Zhou is supported by a start-up grant and the Nie Center for Intelligent Asset Management at Columbia University. His work is also part of a Columbia-City U/HK collaborative project that is supported by the Inno HK Initiative, The Government of the HKSAR, and the AIFT Lab. We are grateful for comments from seminar and conference participants at Chinese University of Hong Kong, University of Iowa, Columbia University, Seoul National University, POSTECH, Ritsumeikan University, Shanghai University of Finance and Economics, Fudan University, The 2022 INFORMS Annual Meeting in Indianapolis, The 11th Annual Meeting of FE Branch in OR Society of China in Shijiazhuang, Conference in Memory of Tomas Bj ork in Stockholm, and Post-Bachelier Congress Workshop in Hong Kong. In particular, we benefit from discussions with Jiro Akahori, Xuefeng Gao, Bong-Gyu Jang, Hyeng Keun Koo, Lingfei Li, Hideo Nagai, Jun Sekine, Wenpin Tang, and David Yao. We are especially indebted to the three anonymous referees for their constructive and detailed comments that have led to an improved version of the paper. Jia and Zhou Appendix A. Soft Q-learning in Discrete Time We review the soft Q-learning for discrete-time Markov decision processes (MDPs) here and present the analogy to some of the major results developed in the main text. It is interesting that, to our best knowledge, the martingale perspective for MDPs is not explicitly presented. This in turn suggests that a continuous-time study may provide new aspects and insights even for discrete-time RL. For simplicity, we consider a time-homogeneous MDP X = {Xt, t = 0, 1, 2, } with a state space X, an action space A, and a transition matrix P(X1 = x |X0 = x, a0 = a) =: p(x |x, a). Both X and A are finite sets. The expected reward at (x, a) is r(x, a) with a discount factor β (0, 1). The agent s total expected reward is E P t=0 βtr(Xt, at) . A (stochastic) policy is denoted by π( |x) P(A), which is a probability mass function on A. Appendix A1. Q-function associated with an arbitrary policy Define the value function associated with a given policy π by t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] Xπ 0 = x =E [r(x, aπ 0 ) γ log π(aπ 0 |x)] + βE h J(Xπ 1 ; π) Xπ 0 = x i , and the Q-function associated with π by Q(x, a; π) =r(x, a) + E t=1 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] Xπ 0 = x, aπ 0 = a =r(x, a) + βE h J(Xa 1; π) Xπ 0 = x, aπ 0 = a i . Adding the entropy term γ log π(a|x) to both sides of (46), integrating over a and noting (45), we obtain a relation between the Q-function and the value function : E [Q(x, aπ; π) J(x; π) γ log π(aπ|x)] = 0. (47) Moreover, substituting J(Xa 1; π) with (47) to the right hand side of (46), we further obtain Q(x, a; π) = r(x, a) + βE h Q(Xa 1, aπ 1 ; π) γ log π(aπ 1 |Xa 1) Xπ 0 = x, aπ 0 = a i . (48) Applying suitable algorithms (e.g., stochastic approximation) to solve the equation (48) leads to the classical SARSA algorithm. On the other hand, rewrite (46) as J(x; π) = r(x, a) [Q(x, a; π) J(x; π)] + βE h J(Xa 1; π) Xπ 0 = x, aπ 0 = a i . (49) Recall that A(x, a; π) := Q(x, a; π) J(x; π) is called the advantage function for MDPs. The q-function in the main text (see (17)) is hence an advantage rate function in the continuous-time setting. In particular, (47) is analogous to the second constraint on qfunction in (29). Continuous-Time q-Learning Further, (49) implies that Mπ s := βs J(Xπ s ; π) + t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i is an ({Fs}s 0, P)-martingale under any policy π , where Fs is the σ-algebra generated by Xπ 0 , aπ 0 , , Xπ s , aπ s . To see this, we compute βs J(Xπ s ; π) + t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i Fs 1 =βs E h J(Xπ s ; π) Xπ s 1, aπ s 1 i + t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i =βs 1 h J(Xπ s 1; π) r(Xπ s 1, aπ s 1) + A(Xπ s 1, aπ s 1; π) i + t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i =βs 1J(Xπ s 1; π) + t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i . (50) Note that here Fs is larger than the usual historical information set up to time s, denoted by Hs, which is the σ-algebra generated by Xπ 0 , aπ 0 , , Xπ s without observing the last aπ s (i.e. Hs is Fs excluding aπ s ). Because Mπ s is measurable to the smaller σ-algebra Hs, it is automatically an ({Hs}s 0, P)-martingale. The above analysis shows that Mπ s is a martingale for any π whether the target policy π or a behavior one so long as the advantage function A is taken as the compensator in defining Mπ s . This is the essential reason why Q-learning works for both onand offpolicy learning. However, the conclusion is not necessarily true in other types of learning methods. For example, let us take a different compensator and investigate the process βs J(Xπ s ; π) + Ps 1 t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i . The continuous-time counterpart of this process is a martingale required in the policy gradient method; see (Jia and Zhou, 2022b, Theorem 4). Now, conditioned on Hs 1 and when π = π, we have βs J(Xπ s ; π) + t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] Hs 1 =βs E h E J(Xπ s ; π) Xπ s 1, aπ s 1 Xπ s 1 i + t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] =βs 1 h J(Xπ s 1; π) r(Xπ s 1, aπ s 1) + A(Xπ s 1, aπ s 1; π) Xπ s 1 i + t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] =βs 1J(Xπ s 1; π) + t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] , Jia and Zhou where the last equality is due to (47), which can be applied because aπ s 1 is generated under the policy π. This shows that the process is an ({Hs}s 0, P)-martingale when π = π, which in turn underpins the on-policy learning. However, when conditioned on Fs 1, we have βs J(Xπ s ; π) + t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i Fs 1 =βs E h J(Xπ s ; π) Xπ s 1, aπ s 1 i + t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i =βs 1 h J(Xπ s 1; π) r(Xπ s 1, aπ s 1) + A(Xπ s 1, aπ s 1; π) i + t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i =βs 1J(Xπ s 1; π) + t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i + βs 1 h A(Xπ s 1, aπ s 1; π) γ log π(aπ t |Xπ t ) i . So the same process is not an ({Fs}s 0, P)-martingale in general, unless A(Xπ s 1, aπ s 1; π) = γ log π(aπ t |Xπ t ) for any Xπ s 1, aπ s 1. Two conclusions can be drawn from this example: on one hand, policy gradient works for on-policy but not off-policy, and on the other hand, it is important to choose the correct filtration to ensure martingality and hence the correct test functions in designing learning algorithms. The analysis in (50) yields a joint martingale characterization of (J, A), analogous to Theorem 7. It is curious that, to our best knowledge, such a martingale condition, while not hard to derive in the MDP setting, has not been explicitly introduced nor utilized to study Q-learning in the existing RL literature. Appendix A2. Q-function associated with the optimal policy Now consider the value function and Q-function that are associated with the optimal policy, denoted by J and Q respectively. Recall that the Bellman equation implies J (x) = sup π Ea π n r(x, a) γ log π(a|x) + βE h J (Xa 1) X0 = x, a0 = a io . (51) It follows from (46) that Q (x, a) = r(x, a) + βE h J (Xa 1) X0 = x, a0 = a i . (52) Hence (51) becomes J (x) = sup π Ea π [Q (x, a) γ log π(a|x)] . Solving the optimization on the right hand side of the above we get the optimal policy π (a|x) exp{ 1 γ Q (x, a)} or π (a|x) = exp{ 1 γ Q (x,a)} P a A exp{ 1 γ Q (x,a)}. Denote softγ max a Q (x, a) := Ea π [Q (x, a) γ log π (a|x)] γ log X γ Q (x, a)}. Continuous-Time q-Learning Then the Bellman equation becomes J (x) = softγ maxa Q (x, a) and (52) reduces to Q (x, a) = r(x, a) + βE softγ max a Q (Xa 1, a ) X0 = x, a0 = a , (53) which is the foundation for (off-policy) Q-learning algorithms; see e.g. (Sutton and Barto, 2018, p. 131). On the other hand, because J (x) = softγ maxa Q (x, a) = γ log P a A exp{ 1 γ Q (x, a)}, we have X γ [Q (x, a) J (x)] = 1. (54) Recall the advantage function A (x, a) = Q (x, a) J (x). Thus (54) is analogous to (25). Moreover, rearranging (52) yields J (x) = r(x, a) [Q (x, a) J (x)] + βE h J (Xa 1) X0 = x, a0 = a i . Based on a similar derivation to that in Appendix A1, we obtain βs J (Xπ s ) + t=0 βt [r(Xπ t , aπ t ) A (Xπ t , aπ t )] is an ({Fs}s 0, P)-martingale for any policy π; hence it is analogous to the martingale characterization of optimal q-function in Theorem 9. Appendix B. t-Parameterized Q-Learning Instead of the (little) q-learning approach developed in this paper, one can also apply the conventional (big) Q-learning method to the Q-function Q t(t, x, a; π) defined in (15). Note that t > 0 becomes a parameter in the latter approach. In this Appendix, we review this t-parameterized Q-learning and the associated Q-learning algorithms, which are used in the simulation experiments of the paper for comparison purpose. Given a policy π and a time step t > 0, Q-learning focuses on learning Q t(t, x, a; π). As with q-learning, there are two distinctive scenarios: when the normalizing constant is easily available and when it is not. In the following, we denote by Qψ t the function approximator to the Q-function where ψ Ψ RLψ is the finite dimensional parameter vector to be learned, and by πψ the associated policy where πψ(a|t, x) exp{ 1 γ t Qψ t(t, x, a)}. Appendix B1. When normalizing constant is available When the normalization constant R A exp{ 1 γ t Qψ t(t, x, a)}da can be explicitly calculated, we can obtain the exact expression of πψ as πψ(a|t, x) = exp{ 1 γ t Qψ t(t, x, a)} R A exp{ 1 γ t Qψ t(t, x, a)}da . Jia and Zhou Then, the conventional Q-learning method (e.g, SARSA, Sutton and Barto 2018, Chapter 10) leads to the following updating rule of the Q-function parameters ψ: Qψ t(t + t, Xat t+ t, at+ t) γ log πψ(at+ t|t + t, Xat t+ t) t Qψ t(t, Xt, at) + r(t, Xt, at) t βQψ t(t, Xt, at) t Qψ t ψ (t, Xt, at), where at πψ( |t, Xt) and at+ t πψ( |t + t, Xt+ t). Appendix B2. When normalizing constant is unavailable When the normalizing constant is not available, we cannot compute the exact expression of πψ. In this case we follow an analysis analogous to that in Subsection 5.1 by using a family of policies {πφ(a|t, x)}φ Φ whose densities can be easily calculated to approximate πψ. Theorem 10 implies that if we start from a policy πφ, φ Φ, and evaluate its corresponding Q-function 1 γ t Qψ t(t, x, a) 1 γ H t, x, , J x(t, x; πφ), 2J x2 (t, x; πφ) + 1 γ t J(t, x; πφ), then a new policy πφ , φ Φ, improves πφ so long as πφ ( |t, x) exp{ 1 γ t Qψ t(t, x, )} DKL πφ( |t, x) exp{ 1 γ t Qψ t(t, x, )} . Thus, following the same lines of argument as in Subsection 5.1, we derive the updating rule of φ every step as log πφ(at|t, Xt) 1 γ t Qψ t(t, Xt, at) φ log πφ(at|t, Xt), where at πφ( |t, Xt). Appendix B3. Ergodic tasks For the general regularized ergodic problem formulated in Section 6, we define the Qfunction: Q t(x, a; π) = J(x; π) + H x, a, J x(x; π), 2J x2 (x; π) V (π) t J. (56) Then we have all the parallel results regarding relation among the Q-function, the value function and the intertemporal increment of the Q-function, and devise algorithms accordingly. For illustration, we only present the results when πψ(a|x) = exp{ 1 γ t Qψ t(x, a)} R A exp{ 1 γ t Qψ t(x, a)}da Continuous-Time q-Learning is explicitly available. The updating rules of the parameters ψ and V are Qψ t(Xat t+ t, at+ t) γ log πψ(at+ t|Xat t+ t) t Qψ t(Xt, at) + r(Xt, at) t V t Qψ t ψ (Xt, at) Qψ t(Xat t+ t, at+ t) γ log πψ(at+ t|Xat t+ t) t Qψ t(Xt, at) + r(Xt, at) t V t , where at πψ( |Xt) and at+ t πψ( |Xt+ t). Appendix B4. Q-learning for mean variance portfolio selection Motivated by the form of the solution in Wang and Zhou (2020), we parameterize the Q-function by Qψ t(t, x, a; w) = e ψ3(T t)[(x w)2+ψ2a(x w)+1 2e ψ1a2]+ψ4(t2 T 2)+ψ5(t T)+(w z)2. Here we use e ψ1 < 0 to guarantee the concavity of the Q-function in a so that the function can be maximized. This Q-function in turn gives rise to an explicit form of policy, which is Gaussian: πψ( |t, x; w) = N ψ2eψ1(x w), γ teψ3(T t)+ψ1 exp{ 1 γ t Qψ t(t, x, )}. We need the following derivative in the corresponding Q-learning algorithms: Qψ t ψ (t, x, a; w) = 1 2e ψ3(T t)e ψ1a2 e ψ3(T t)e ψ1a(x w) (T t)e ψ3(T t)[(x w)2 + ψ2a(x w) + 1 2e ψ1a2] t2 T 2 The parameters in the Q-function can then be updated based on (55) and the Lagrange multiplier can be updated similarly as in Algorithm 5. Appendix B5. Q-learning for ergodic LQ control For the ergodic LQ control problem formulated in Subsection 7.2, we can prove that the value function and the Hamiltonian are both quadratic in (x, a); see, e.g., Wang et al. (2020). Therefore we parameterize the Q-function by a general quadratic function: Qψ t(x, a) = e ψ3 2 (a ψ1x ψ2)2 + ψ4x2 + ψ5x. Jia and Zhou Here we omit the constant term since the value function is unique only up to a constant. We also use e ψ3 < 0 to ensure that the Q-function is concave in a. The optimal value V is an extra parameter. The Q-function leads to πψ( |x) = N ψ1x + ψ2, γ teψ3 exp{ 1 γ t Qψ t(x, )}. Finally, we compute the following derivative for use in Q-learning algorithms: Qψ t ψ (x, a) = e ψ3(a ψ1x ψ2)x, e ψ3(a ψ1x ψ2), e ψ3 2 (a ψ1x ψ2)2, x2, x . Then the parameters can be updated based on (57). Appendix C. Proofs of Statements Proof of Theorem 2 We first prove a preliminary result about the entropy maximizing density. Lemma 13 Let γ > 0 and a measurable function q : A R with R γ q(a)}da < be given. Then π (a) = exp{ 1 γ q(a)}da P(A) is the unique maximizer of the following max π( ) P(A) A [q(a) γ log π(a)]π(a)da. (58) Proof Set ν = 1 log R γ q(a)}da and consider a new problem: max π( ) P(A) [q(a) γ log π(a)]π(a) + γνπ(a) da γν. (59) A π(a)da = 1 due to π( ) P(A), problems (58) and (59) are equivalent; hence we focus on problem (59). Relaxing the constraint of being a density function we deduce max π( ) P(A) [q(a) γ log π(a)]π(a) + γνπ(a) da γν [q(a) γ log π(a)]π(a) + γνπ(a) da γν A max π>0 [q(a) γ log π]π + γνπ da γν. The unique maximizer of the inner optimization on the right hand side of the above is γ q(a) + ν 1} = exp{ 1 γ q(a)}da =: π (a). As π P(A), it is the optimal solution to (59) or, equivalently, (58). The uniqueness follows from that of the inner optimization in (60). Continuous-Time q-Learning Now we are ready to prove Theorem 2. To ease notation we use the q-function q even though it had not been introduced prior to Theorem 2 in the main text. We also employ the results of Theorem 6 whose proof does not rely on Theorem 2 nor its proof here. Recall from (16), q(t, x, a; π) is defined to be q(t, x, a; π) = J t (t, x; π) + H t, x, a, J x(t, x; π), 2J x2 (t, x; π) βJ(t, x; π), with the Hamiltonian H defined in (3). Proof [Proof of Theorem 2] We work on the equivalent formulation (8) (9). For two given admissible policies π and π , and any 0 t T, apply Itˆo s lemma to the process e βs J(s, Xπ s ; π) from s [t, T], which is the value function under π but over the state process under π : e βT J(T, Xπ T ; π) e βt J(t, Xπ t ; π) + Z T A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads t (s, Xπ s ; π) + H s, Xπ s , a, J x(s, Xπ s ; π), 2J x2 (s, Xπ s ; π) βJ(s, Xπ s ; π) γ log π (a|s, Xπ s ) π (a|s, Xπ s )dads + Z T x J(s, Xπ s ; π) σ s, Xπ s , π ( |s, Xπ s ) d Ws q(s, Xπ s , a; π) γ log π (a|s, Xπ s ) π (a|s, Xπ s )dads x J(s, Xπ s ; π) σ s, Xπ s , π ( |s, Xπ s ) d Ws. Because π = Iπ, it follows from Lemma 13 that for any (s, y) [0, T] Rd, we have Z q(s, y, a; π) γ log π (a|s, y) π (a|s, y)da Z q(s, y, a; π) γ log π(a|s, y) π(a|s, y)da = 0, where the equality is due to (20) in Theorem 6. Thus, e βT J(T, Xπ T ; π) e βt J(t, Xπ t ; π) + Z T A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads x J(s, Xπ s ; π) σ s, Xπ s , π ( |s, Xπ s ) d Ws. The above argument and the resulting inequalities are also valid when T is replaced by T un, where un = inf{s t : | Xπ s | n} is a sequence of stopping times. Therefore, J(t, x; π) EPW e β(T un t)h( Xπ T un) t e β(s t) Z A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads Xπ t = x . Jia and Zhou It follows from Assumption 1-(iv), Definition 1-(iii) and the moment estimate in Jia and Zhou (2022b, Lemma 2) that there exist constants µ, C > 0 such that EPW Z T un t e β(s t) Z A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads Xπ t = x A |r(s, Xπ s , a) γ log π (a|s, Xπ s )|π (a|s, Xπ s )dads Xπ t = x t EPW (1 + | Xπ s |µ) Xπ t = x dt C(1 + |x|µ). By the dominated convergence theorem, we have as n , t e β(s t) Z A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads Xπ t = x t e β(s t) Z A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads Xπ t = x . In addition, EPW e β(T un t)h( Xπ T un) Xπ t = x EPW h( Xπ T )1{maxt s T | Xπ T | 0 such that f(t , x , a ) > ϵ. Because f is continuous, there exists δ > 0 such that f(u, x , a ) > ϵ/2 for all (u, x , a ) with |u t | |x x | |a a | < δ. Here means taking the larger one, i.e., u v = max{u, v}. Now consider the state process, still denoted by Xπ, starting from (t , x , a ), namely, {Xπ s , t s T} follows (6) with Xπ t = x and aπ t = a . Define τ = inf{u t : |u t | > δ or |Xπ u x | > δ} = inf{u t : |Xπ u x | > δ} (t +δ). The continuity of Xπ implies that u > t , P-almost surely. Here means taking the smaller one, i.e., u v = min{u, v}. We have already proved that there exists Ω0 F with P(Ω0) = 0 such that for all ω Ω\ Ω0, R s t e βuf(u, Xπ u (ω), aπ u (ω))du = 0 for all s [t , T]. It follows from Lebesgue s differentiation theorem that for any ω Ω\ Ω0, f(s, Xπ s (ω), aπ s (ω)) = 0, a.e. s [t , τ(ω)]. Consider the set Z(ω) = {s [t , τ(ω)] : aπ s (ω) Bδ(a )} [t , τ(ω)], where Bδ(a ) = {a A : |a a | > δ} is the neighborhood of a . Because f(s, Xπ s (ω), aπ s (ω)) > ϵ 2 when s Z(ω), we conclude that Z(ω) has Lebesgue measure zero for any ω Ω\Ω0. That is, Z [t ,T] 1{s τ(ω)}1{aπ s (ω) Bδ(a )}ds = 0. Continuous-Time q-Learning Integrating ω with respect to P and applying Fubini s theorem, we obtain [t ,T] 1{s τ(ω)}1{aπ s (ω) Bδ(a )}ds P(dω) = Z [t ,T] E[1{s τ}1{aπ s Bδ(a )}]ds t E 1{s τ}P (aπ s Bδ(a )|Fs) ds = Z T Bδ(a ) π(a|s, Xπ s )da min |x x |<δ,|u t |<δ Bδ(a ) π(a|u, x )da t E 1{s τ} ds = min |x x |<δ,|u t |<δ Bδ(a ) π(a|u, x )da E[(τ T) t ] 0. Since τ > t P-almost surely, the above implies min|x x |<δ,|u t |<δ n R Bδ(a ) π(a|u, x )da o = 0. However, this contradicts Definition 1 about an admissible policy. Indeed, Definition 1-(i) stipulates supp π( |t, x) = A for any (t, x); hence R Bδ(a ) π(a|t, x)da > 0. Then the continuity in Definition 1-(ii) yields min|x x |<δ,|u t |<δ n R Bδ(a ) π(a|u, x )da o > 0, a contradiction. Hence we conclude q(t, x, a; π) = ˆq(t, x, a) for every (t, x, a). (ii) Applying Itˆo s lemma to e βs J(s, Xπ s ; π), we get e βs J(s, Xπ s ; π) e βt J(t, x; π) + Z s t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du t e βu q(u, Xπ u , aπ u ; π) ˆq(u, Xπ u , aπ u ) du + Z s x J(u, Xπ u ; π) σ(u, Xπ u , aπ u )d Wu. So, when ˆq q( , , ; π), the above process is an ({Fs}s 0, P)-martingale on [t, T]. (iii) Let π Π be given satisfying the assumption in this part. It then follows from (ii) that R s t e βu[ˆq(u, Xπ u , aπ u ) q(u, Xπ u , aπ u ; π)]du is an ({Fs}s 0, P)-martingale. If the desired conclusion is not true, then the same argument in (i) still applies to conclude that min|x x |<δ,|u t |<δ n R Bδ(a ) π (a|u, x )da o = 0, which is a contradiction because π is admissible. Proof of Theorem 7 (i) We only prove the only if part because the if part is straightforward following the same argument as in the proof of Theorem 6. So, we assume e βt ˆJ(t, Xπ t )+ R t 0 e βs[r(s, Xπ s , aπ s ) ˆq(s, Xπ s , aπ s )]ds is an ({Fs}s 0, P)- martingale. Hence for any initial state (t, x), we have EP e βT ˆJ(T, Xπ T ) + Z T t e βs[r(s, Xπ s , aπ s ) ˆq(s, Xπ s , aπ s )]ds Ft = e βt ˆJ(t, x). Jia and Zhou We integrate over the action randomization with respect to the policy π, and then obtain EPW e βT ˆJ(T, Xπ T ) + Z T A [r(s, Xπ s , a) ˆq(s, Xπ s , a)]π(a|s, Xπ s )dads FW t = e βt ˆJ(t, x) This, together with the terminal condition ˆJ(T, x) = h(x), and constraint (21) yields that ˆJ(t, x) = EPW e β(T t)h( Xπ T ) + Z T t e β(s t) Z A [r(s, Xπ s , a) γ log π(s, Xπ s , a)]π(a|s, Xπ s )dads FW t Hence ˆJ(t, x) = J(t, x; π) for all (t, x) [0, T] Rd. Furthermore, based on Theorem 6, the martingale condition implies that ˆq(t, x, a) = q(t, x, a; π) for all (t, x, a) [0, T] Rd A. (ii) This follows immediately from Theorem 6-(ii). (iii) Let π Π be given satisfying the assumption in this part. Define ˆr(t, x, a) := ˆJ t (t, x) + b(t, x, a) ˆJ x(t, x) + 1 2σσ (t, x, a) 2 ˆJ x2 (t, x) β ˆJ(t, x). Then e βs ˆJ(s, Xπ s ) Z s t e βuˆr(u, Xπ u , aπ u )du is an ({Fs}s 0, P)-local martingale, which follows from applying Itˆo s lemma to the above process and arguing similarly to the proof of Theorem 6-(ii). As a result, R s t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )+ˆr(u, Xπ u , aπ u )]du is an ({Fs}s 0, P)-martingale. The same argument as in the proof of Theorem 6 applies and we conclude ˆq(t, x, a) =r(t, x, a) + ˆr(t, x, a) t (t, x) + b(t, x, a) ˆJ x(t, x) + 1 2σσ (t, x, a) 2 ˆJ x2 (t, x) β ˆJ(t, x) + r(t, x, a) t (t, x) + H t, x, a, ˆJ x(t, x), 2 ˆJ x2 (t, x) β ˆJ(t, x) for every (t, x, a). Now the constraint (21) reads Z t (t, x) + H t, x, a, ˆJ x(t, x), 2 ˆJ x2 (t, x) β ˆJ(t, x) γ log π(a|t, x) π(a|t, x)da = 0, for all (t, x), which, together with the terminal condition ˆJ(T, x) = h(x), is the Feynman Kac PDE (11) for ˆJ. Therefore, it follows from the uniqueness of the solution to (11) that ˆJ J( , ; π). Moreover, it follows from Theorem 6-(iii) that ˆq q( , , ; π). Finally, if π(a|t, x) = exp{ 1 γ ˆq(t,x,a)} R γ ˆq(t,x,a)}da, then π = Iπ where I is the map defined in Theorem 2. This in turn implies π is the optimal policy and, hence, ˆJ is the optimal value function Continuous-Time q-Learning Proof of Proposition 8 Proof Divide both sides of (24) by γ, take exponential and then integrate over A. Comparing the resulting equation with the exploratory HJB equation (14), we get γ q (t, x, a)}da = 0. This yields (25). The expression (26) follows then from (13). Proof of Theorem 9 (i) Let c J = J and bq = q be the optimal value function and optimal q-function respectively. For any π Π, applying Itˆo s lemma to the process e βs J (s, Xπ s ), we obtain for 0 t < s T: e βs J (s, Xπ s ) e βt J (t, x) + Z s t e βu[r(u, Xπ u , aπ u ) q (u, Xπ u , aπ u )]du t (u, Xπ u , aπ u ) + H u, Xπ u , aπ u , J x (u, Xπ u ), 2J x2 (u, Xπ u ) βJ (u, Xπ u ) q (u, Xπ u , aπ u ) i du + Z s x J (u, Xπ u ) d Wu x J (u, Xπ u ) σ(u, Xπ u , aπ u )d Wu, where the R du term vanishes due to the definition of q in (24). Hence (28) is a martingale. (ii) The second constraint in (27) implies that c π (a|t, x) := exp{ 1 γ bq (t, x, a)} is a prob- ability density function, and bq (t, x, a) = γ log c π (a|t, x). So bq (t, x, a) satisfies the second constraint in (21) with respect to the policy c π . When (28) is an ({Fs}s 0, P)- martingale under the given admissible policy π, it follows from Theorem 7 (iii) that c J and bq are respectively the value function and the q-function associated with c π . Then the improved policy is Ic π (a|t, x) := exp{ 1 γ b q (t,x,a)} R γ b q (t,x,a)}da = exp{ 1 γ bq (t, x, a)} = c π (a|t, x). However, Theorem 2 yields that c π is optimal, completing the proof. Jia and Zhou Proof of Theorem 10 Proof Let us compute the KL-divergence: Z log π (a|t, x) 1 γ H t, x, a, J x(t, x; π), 2J x2 (t, x; π) π (a|t, x)da π ( |t, x) exp{1 γ H t, x, , J x(t, x; π), 2J x2 (t, x; π) } π( |t, x) exp{1 γ H t, x, , J x(t, x; π), 2J x2 (t, x; π) } log π(a|t, x) 1 γ H t, x, a, J x(t, x; π), 2J x2 (t, x; π) π(a|t, x)da. A q(t, x, a; π) γ log π (a|t, x) π (a|t, x)da R A q(t, x, a; π) γ log π(a|t, x) π(a|t, x)da. Following the same argument as in the proof of Theorem 2, we conclude J(t, x; π ) J(t, x; π). Proof of Lemma 11 Proof The first statement has been prove in Jia and Zhou (2022b, Lemma 7). We focus on the proof of the second statement, which is similar to that of Theorem 2. For the two admissible policies π and π , we apply Itˆo s lemma to the value function under π along the process Xπ to obtain J( Xπ u ; π) J( Xπ t ; π) + Z u A [r( Xπ s , a) γ log π (a| Xπ s )]π (a| Xπ s )dads H Xπ s , a, J x( Xπ s ; π), 2J x2 ( Xπ s ; π) γ log π (a|Xπ s ) π (a| Xπ s )dads x J( Xπ s ; π) σ Xπ s , π ( | Xπ s ) d Ws H Xπ s , a, J x( Xπ s ; π), 2J x2 ( Xπ s ; π) γ log π(a|Xπ s ) π(a| Xπ s )dads x J( Xπ s ; π) σ Xπ s , π ( | Xπ s ) d Ws =V (π)(u t) + Z u x J( Xπ s ; π) σ Xπ s , π ( | Xπ s ) d Ws, where the inequality is due to the same argument as in the proof of Theorem 10, and the last equality is due to the Feynman Kac formula (38). Therefore, after a localization argument, we have 1 T E J( Xπ T ; π) J(x; π)+ Z T A [r( Xπ s , a) γ log π (a| Xπ s )]π (a| Xπ s )dads Xπ 0 = x V (π). Taking the limit T , we obtain V (π ) V (π). Continuous-Time q-Learning Proof of Theorem 12 Proof The proof is parallel to that of Theorems 6 and 7, which is omitted. L. C. Baird. Advantage updating. Technical report, Write Lab Wright-Patterson Air Force Base, OH 45433-7301, USA, 1993. L. C. Baird. Reinforcement learning in continuous time: Advantage updating. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 94), volume 4, pages 2448 2453. IEEE, 1994. C. Beck, M. Hutzenthaler, and A. Jentzen. On nonlinear Feynman Kac formulas for viscosity solutions of semilinear parabolic partial differential equations. Stochastics and Dynamics, page 2150048, 2021. G. Brunick and S. Shreve. Mimicking an Itˆo process by a solution of a stochastic differential equation. The Annals of Applied Probability, 23(4):1584 1628, 2013. K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12 (1):219 245, 2000. Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329 1338. PMLR, 2016. J. Fan, Z. Wang, Y. Xie, and Z. Yang. A theoretical analysis of deep Q-learning. In Learning for Dynamics and Control, pages 486 489. PMLR, 2020. W. H. Fleming and H. M. Soner. Controlled Markov Processes and Viscosity Solutions, volume 25. Springer Science & Business Media, 2006. X. Gao, Z. Q. Xu, and X. Y. Zhou. State-dependent temperature control for Langevin diffusions. SIAM Journal on Control and Optimization, pages 1 26, 2020. S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with modelbased acceleration. In International Conference on Machine Learning, pages 2829 2838. PMLR, 2016. T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861 1870. PMLR, 2018a. T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018b. Jia and Zhou X. Han, R. Wang, and X. Y. Zhou. Choquet regularization for reinforcement learning. ar Xiv preprint ar Xiv:2208.08497, 2022. Y. Jia and X. Y. Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(198): 1 55, 2022a. Y. Jia and X. Y. Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(275):1 50, 2022b. I. Karatzas and S. Shreve. Brownian Motion and Stochastic Calculus, volume 113. Springer, 2014. J. Kim, J. Shin, and I. Yang. Hamilton-Jacobi deep Q-learning for deterministic continuoustime systems with Lipschitz continuous controls. Journal of Machine Learning Research, 22:206 1, 2021. A. E. Lim and X. Y. Zhou. Mean-variance portfolio selection with random parameters in a complete market. Mathematics of Operations Research, 27(1):101 120, 2002. X. Mao. Stochastic Differential Equations and Applications. Woodhead, 2 edition, 2007. ISBN 978-1-904275-34-3. F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th International Conference on Machine Learning, pages 664 671, 2008. S. P. Meyn and R. L. Tweedie. Stability of Markovian processes III: Foster Lyapunov criteria for continuous-time processes. Advances in Applied Probability, 25(3):518 548, 1993. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928 1937. PMLR, 2016. B. O Donoghue. Variational Bayesian reinforcement learning with regret bounds. Advances in Neural Information Processing Systems, 34:28208 28221, 2021. I. Osband, B. Van Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1 62, 2019. J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft Qlearning. ar Xiv preprint ar Xiv:1704.06440, 2017. D. Shah and Q. Xie. Q-learning with nearest neighbors. Advances in Neural Information Processing Systems, 31, 2018. Continuous-Time q-Learning Y. Sun. The exact law of large numbers via Fubini extension and characterization of insurable risks. Journal of Economic Theory, 126(1):31 69, 2006. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2018. C. Tallec, L. Blier, and Y. Ollivier. Making deep Q-learning methods robust to time discretization. In International Conference on Machine Learning, pages 6096 6104. PMLR, 2019. W. Tang, Y. P. Zhang, and X. Y. Zhou. Exploratory HJB equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191 3216, 2022. M. Uehara, C. Shi, and N. Kallus. A review of off-policy evaluation in reinforcement learning. ar Xiv preprint ar Xiv:2212.06355, 2022. H. Wang and X. Y. Zhou. Continuous-time mean variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273 1308, 2020. H. Wang, T. Zariphopoulou, and X. Y. Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21 (198):1 34, 2020. C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279 292, 1992. C. J. C. H. Watkins. Learning from delayed rewards. Ph D thesis, Cambridge University, 1989. W. M. Wonham. On the separation theorem of stochastic control. SIAM Journal on Control, 6(2):312 326, 1968. J. Yong and X. Y. Zhou. Stochastic Controls: Hamiltonian Systems and HJB Equations. New York, NY: Spinger, 1999. X. Y. Zhou. Curse of optimality, and how do we break it. SSRN preprint SSRN 3845462, 2021. X. Y. Zhou and D. Li. Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1):19 33, 2000. X. Y. Zhou and G. Yin. Markowitz s mean-variance portfolio selection with regime switching: A continuous-time model. SIAM Journal on Control and Optimization, 42(4):1466 1482, 2003. S. Zou, T. Xu, and Y. Liang. Finite-sample analysis for SARSA with linear function approximation. Advances in Neural Information Processing Systems, 32, 2019.