# qlearning_in_continuous_time__177ac68e.pdf

Journal of Machine Learning Research 24 (2023) 1-61 Submitted 7/22; Revised 4/23; Published 5/23

q-Learning in Continuous Time

Yanwei Jia jiayw1993@gmail.com Department of System Engineering and Engineering Management The Chinese University of Hong Kong Shatin, NT, Hong Kong

Xun Yu Zhou xz2574@columbia.edu Department of Industrial Engineering and Operations Research & The Data Science Institute Columbia University New York, NY 10027, USA

Editor: Marc Bellemare

We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diﬀusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its ﬁrst-order approximation and coin the term (little) q-function . This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a q-learning theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and oﬀ-policy settings. We then apply the theory to devise diﬀerent actor critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.

Keywords: continuous-time reinforcement learning, policy improvement, q-function, martingale, on-policy and oﬀ-policy

1. Introduction

A recent series of papers, Wang et al. (2020) and Jia and Zhou (2022a,b), aim at laying an overarching theoretical foundation for reinforcement learning (RL) in continuous time with continuous state space (diﬀusion processes) and possibly continuous action space. Speciﬁcally, Wang et al. (2020) study how to explore strategically (instead of blindly) by formulating an entropy-regularized, distribution-valued stochastic control problem for diffusion processes. Jia and Zhou (2022a) solve the policy evaluation (PE) problem, namely to learn the value function of a given stochastic policy, by characterizing it as a martingale problem. Jia and Zhou (2022b) investigate the policy gradient (PG) problem, that is, to

c 2023 Yanwei Jia and Xun Yu Zhou.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-0755.html.

Jia and Zhou

determine the gradient of a learned value function with respect to the current policy, and show that PG is mathematically a simpler PE problem and thus solvable by the martingale approach developed in Jia and Zhou (2022a). Combining these theoretical results naturally leads to various online and oﬄine actor critic (AC) algorithms for general model-free (up to the diﬀusion dynamics) RL tasks, where one learns value functions and stochastic policies simultaneously and alternatingly. Many of these algorithms recover and/or interpret some well-known existing RL algorithms for Markov decision processes (MDPs) that were often put forward in a heuristic and ad hoc manner. PG updates a policy along the gradient ascent direction to improve it; so PG is an instance of the general policy improvement (PI) approach. On the other hand, one of the earliest and most popular methods for PI is Q-learning (Watkins, 1989; Watkins and Dayan, 1992), whose key ingredient is to learn the Q-function, a function of state and action. The learned Q-function is then maximized over actions at each state to achieve improvement of the current policy. The resulting Q-learning algorithms, such as the Q-table, SARSA and DQN (deep Q-network), are widely used for their simplicity and eﬀectiveness in many applications.1 Most importantly though, compared with PG, one of the key advantages of using Q-learning is that it works both on-policy and oﬀ-policy (Sutton and Barto, 2018, Chapter 6). The Q-function, by deﬁnition, is a function of the current state and action, assuming that the agent takes a particular action at the current time and follows through a given control policy starting from the next time step. Therefore, it is intrinsically a notion in discrete time; that is why Q-learning has been predominantly studied for discrete-time MDPs. In a continuous-time setting, the Q-function collapses to the value function that is independent of action and hence cannot be used to rank and choose the current actions. Indeed, Tallec et al. (2019) opine that there is no Q-function in continuous time . On the other hand, one may propose discretizing time to obtain a discretized Q-function and then apply the existing Q-learning algorithms. However, Tallec et al. (2019) show experimentally that this approach is very sensitive to time discretization and performs poorly with small time steps. Kim et al. (2021) take a diﬀerent approach: they include the action as a state variable in the continuous-time system by restricting the action process to be absolutely continuous in time with a bounded growth rate. Thus Q-learning becomes a policy evaluation problem with the state action pair as the new state variable. However, they consider only deterministic dynamic systems and discretize upfront the continuous-time problem. Crucially, that the action must be absolutely continuous is unpractically restrictive because optimal actions are often only measurable in time and have unbounded variations (such as the bang bang controls) even for deterministic systems. As a matter of fact, the aforementioned series of papers, Wang et al. (2020) and Jia and Zhou (2022a,b), are characterized by carrying out all the theoretical analysis within the continuous setting and taking observations at discrete times (for computing the to-

1. Q-learning is typically combined with various forms of function approximations. For example, linear function approximations (Melo et al., 2008; Zou et al., 2019), kernel-based nearest neighbor regression (Shah and Xie, 2018), or deep neural networks (Mnih et al., 2015; Fan et al., 2020). Q-learning is not necessarily restricted to ﬁnite action space in literature; for example, Q-learning with continuous action spaces is studied in Gu et al. (2016). However, Duan et al. (2016) report and claim that the standard Q-learning algorithm may become less eﬃcient for continuous action spaces.

Continuous-Time q-Learning

tal reward over time) only when implementing the algorithms. Some advantages of this approach, compared with discretizing time upfront and then applying existing MDP results, are discussed extensively in Doya (2000); Wang et al. (2020); Jia and Zhou (2022a,b). More importantly, the pure continuous-time approach minimizes or eliminates the impact of time-discretization on learning which, as discussed above, becomes critical especially for continuous-time Q-learning.

Now, if we are to study Q-learning strictly within the continuous-time framework without time-discretization or action restrictions, then the ﬁrst question is what a proper Qfunction should be. When time is discretized, Baird (1993) and Mnih et al. (2016) deﬁne the so-called advantage function that is the diﬀerence in value between taking a particular action versus following a given policy at any state; that is, the diﬀerence between the state action value and the state value. Baird (1993, 1994) and Tallec et al. (2019) notice that such an advantage function can be properly scaled with respect to the size of time discretization whose continuous-time limit exists, called the (instantaneous) advantage rate function and can be learned. The advantage updating performs better than conventional Q-learning, as shown numerically in Tallec et al. (2019). However, these papers again consider only deterministic dynamic systems and do not fully develop a theory of this rate function. Here in the present paper, we attack the problem directly from the continuous-time perspective. In particular, we rename the advantage rate function as the (little) q-function.2 This qfunction in the ﬁnite-time episodic setting is a function of the time state action triple under a given policy, analogous to the conventional Q-function. However, it is a continuous-time notion because it does not depend on time-discretization. This feature is a vital advantage for learning algorithm design as it avoids the sensitivity with respect to the observation and intervention frequency (the step size of time-discretization).

For the entropy-regularized, exploratory setting for diﬀusion processes (ﬁrst formulated in Wang et al. 2020) studied in the paper, the q-function of a given policy turns out to be the Hamiltonian that includes the inﬁnitesimal generator of the dynamic system and the instantaneous reward function (see Yong and Zhou 1999 for details), plus the temporal dispersion that consists of the time-derivative of the value function and the depreciation from discounting. The paper aims to develop a comprehensive q-learning theory around this q-function and accordingly design alternative RL algorithms other than the PG-based AC algorithms in Jia and Zhou (2022b). There are three main questions. The ﬁrst is to characterize the q-function. For discrete-time MDPs, the PE method is used to learn both the value function and the Q-function because both functions are characterized by similar Bellman equations. By contrast, in the continuous-time diﬀusion setting, a Bellman equation is only available for the value function (also known as the Feynman Kac formula, which yields that the value function solves a second-order linear partial diﬀerential equation). The second question is to establish a policy improvement theorem based on the q-function and to design corresponding AC algorithms in a sample/data-driven, model-free fashion. The third question is whether the capability of both onand oﬀ-policy learning, a key advantage of Q-learning, is intact in the continuous-time setting.

2. We use the lower case letter q to denote this rate function, much in the same way as the typical use of the lower case letter f to denote a probability density function which is the ﬁrst-order derivative of the cumulative distribution function usually denoted by the upper case letter F.

Jia and Zhou

We provide complete answers to all these questions. First, given a stochastic policy along with its (learned) value function, the corresponding q-function is characterized by the martingale condition of a certain stochastic process with respect to an enlarged ﬁltration (information ﬁeld) that includes both the original environmental noise and the action randomization. Alternatively, the value function and the q-function can be jointly determined by the martingality of the same process. These results suggest that the martingale perspective for PE and PG in Jia and Zhou (2022a,b) continues to work for q-learning. In particular, we devise several algorithms to learn these functions in the same way as the martingale conditions are employed to generate PE and PG algorithms in Jia and Zhou (2022a,b). Interestingly, a temporal diﬀerence (TD) learning algorithm designed from this perspective recovers and, hence, interprets a version of the well-known SARSA algorithm in the discrete-time Q-learning literature. Moreover, inspired by these continuous-time martingale characterizations, in Appendix A we present martingale conditions for Q-learning of discrete-time MDPs, which are new to our best knowledge. This demonstrates that the continuous-time RL study may oﬀer new perspectives even in discrete time. Second, we prove that a Gibbs sampler with a properly scaled current q-function (independent of time discretization) as the exponent of its density function improves upon the current policy. This PI result translates into a policy-updating algorithm when the normalizing constant of the Gibbs measure can be explicitly computed. Otherwise, if the normalizing constant is unavailable, we prove a stronger PI theorem in terms of the Kullback Leibler (KL) divergence, analogous to a result for discrete-time MDPs in Haarnoja et al. (2018a). One of the algorithms out of this theorem happens to recover a PG-based AC algorithm in Jia and Zhou (2022b). Finally, we prove that the aforementioned martingale conditions hold for both the on-policy and oﬀ-policy settings. More precisely, the value function and the q-function associated with a given target policy can be learned based on a dataset generated by either the target policy itself or an arbitrary behavior policy. So, q-learning indeed works oﬀ-policy as well. The rest of the paper is organized as follows. In Section 2 we review Wang et al. (2020) s entropy-regularized, exploratory formulation for RL in continuous time and space, and present some useful preliminary results. In Section 3, we establish the q-learning theory, including the motivation and deﬁnition of the q-function and its on-/oﬀ-policy martingale characterizations. q-learning algorithms are developed, depending on whether or not the normalizing constant of the Gibbs measure is available, in Section 4 and Section 5, respectively. Extension to ergodic tasks is presented in Section 6. In Section 7, we illustrate the proposed algorithms and compare them with PG-based algorithms and time-discretized, conventional Q-learning algorithms on two speciﬁc examples with simulated data.3 Finally, Section 8 concludes. The Appendix contains various supplementary materials and proofs of statements in the main text.

2. Problem Formulation and Preliminaries

Throughout this paper, by convention all vectors are column vectors unless otherwise speciﬁed, and Rk is the space of all k-dimensional vectors (hence k 1 matrices). We use Sk ++ to

3. The code to reproduce our simulation studies is publicly available at https://www.dropbox.com/sh/ 34cgnupnuaix15l/AAAj2y QYf NCOt PUc1_7Vhbk Ia?dl=0.

Continuous-Time q-Learning

denote all the k k symmetric and positive deﬁnite matrices. Given two matrices A and B of the same size, denote by A B their inner product, by |A| the Euclidean/Frobenius norm of A, by det A the determinant when A is a square matrix, and by A the transpose of any matrix A. For A Sk ++, we write

A = UD1/2U , where A = UDU is its eigenvalue decomposition with U an orthogonal matrix, D a diagonal matrix, and D1/2 the diagonal matrix whose entries are the square root of those of D. We use f = f( ) to denote the function f, and f(x) to denote the function value of f at x. We denote by N(µ, Σ) the probability density function of the multivariate normal distribution with mean vector µ and covariance matrix Σ. Finally, for any stochastic process X = {Xs, s 0}, we denote by {FX s }s 0 the natural ﬁltration generated by X.

2.1 Classical model-based formulation

For readers convenience, we ﬁrst recall the classical, model-based stochastic control formulation. Let d, n be given positive integers, T > 0, and b : [0, T] Rd A 7 Rd and σ : [0, T] Rd A 7 Rd n be given functions, where A Rm is the action/control space. The classical stochastic control problem is to control the state (or feature) dynamics governed by a stochastic diﬀerential equation (SDE), deﬁned on a ﬁltered probability space Ω, F, PW ; {FW s }s 0 along with a standard n-dimensional Brownian motion W = {Ws, s 0}: d Xa s = b(s, Xa s , as)ds + σ(s, Xa s , as)d Ws, s [0, T], (1)

where as stands for the agent s action at time s. The goal of a stochastic control problem is, for each initial time state pair (t, x) [0, T) Rd of (1), to ﬁnd the optimal {FW s }s 0-progressively measurable (continuous-time) sequence of actions a = {as, t s T} also called the optimal control or optimal strategy that maximizes the expected total discounted reward:

t e β(s t)r(s, Xa s , as)ds + e β(T t)h(Xa T ) Xa t = x , (2)

where r is an (instantaneous) reward function (i.e., the expected rate of reward conditioned on time, state, and action), h is the lump-sum reward function applied at the end of the planning period T, and β 0 is a constant discount factor that measures the time-value of the payoﬀor the impatience level of the agent. Note in the above, the state process Xa = {Xa s , t s T} also depends on (t, x). However, to ease notation, here and henceforth we use Xa instead of Xt,x,a = {Xt,x,a s , t s T} to denote the solution to SDE (1) with initial condition Xa t = x whenever no ambiguity may arise. The (generalized) Hamiltonian is a function H : [0, T] Rd A Rd Rd d R associated with problem (1) (2) deﬁned as (Yong and Zhou, 1999):

H(t, x, a, p, q) = b(t, x, a) p + 1

2σσ (t, x, a) q + r(t, x, a). (3)

We make the same assumptions as in Jia and Zhou (2022b) to ensure theoretically the well-posedness of the stochastic control problem (1) (2).

Jia and Zhou

Assumption 1 The following conditions for the state dynamics and reward functions hold true:

(i) b, σ, r, h are all continuous functions in their respective arguments;

(ii) b, σ are uniformly Lipschitz continuous in x, i.e., for ϕ {b, σ}, there exists a constant C > 0 such that

|ϕ(t, x, a) ϕ(t, x , a)| C|x x |, (t, a) [0, T] A, x, x Rd;

(iii) b, σ have linear growth in x, i.e., for ϕ {b, σ}, there exists a constant C > 0 such that |ϕ(t, x, a)| C(1 + |x|), (t, x, a) [0, T] Rd A;

(iv) r and h have polynomial growth in (x, a) and x respectively, i.e., there exist constants C > 0 and µ 1 such that

|r(t, x, a)| C(1 + |x|µ + |a|µ), |h(x)| C(1 + |x|µ), (t, x, a) [0, T] Rd A.

Classical model-based stochastic control theory has been well developed (e.g., Fleming and Soner 2006 and Yong and Zhou 1999) to solve the above problem, assuming that the functional forms of b, σ, r, h are all given and known. A centerpiece of the theory is the Hamilton Jacobi Bellman (HJB) equation

t (t, x) + sup a A H t, x, a, V

x (t, x), 2V

x2 (t, x) βV (t, x) = 0, V (T, x) = h(x). (4)

x Rd is the gradient and 2V

x2 Rd d is the Hessian matrix. Under proper conditions, the unique solution (possibly in the sense of viscosity solution) to (4) is the optimal value function to the problem (1) (2), i.e.,

V (t, x) = sup {as,t s T} EPW Z T

t e β(s t)r(s, Xa s , as)ds + e β(T t)h(Xa T ) Xa t = x .

Moreover, the following function, which maps a time state pair to an action:

a (t, x) = arg sup a A H t, x, a, V

x (t, x), 2V

x2 (t, x) (5)

is the optimal (non-stationary, feedback) control policy of the problem. In view of the deﬁnition of the Hamiltonian (3), the maximization in (5) indicates that, at any given time and state, one ought to maximize a suitably weighted average of the (myopic) instantaneous reward and the risk-adjusted, (long-term) positive impact on the system dynamics. See Yong and Zhou (1999) for a detailed account of this theory and many discussions on its economic interpretations and implications.

Continuous-Time q-Learning

2.2 Exploratory formulation in reinforcement learning

We now present the RL formulation of the problem to be studied in this paper. In the RL setting, the agent has partial or simply no knowledge about the environment (i.e. the functions b, σ, r, h). What she can do is trial and error to try a (continuous-time) sequence of actions a = {as, t s T}, observe the corresponding state process Xa = {Xa s , t s T} along with both a stream of discounted running rewards {e β(s t)r(s, Xa s , as), t s T} and a discounted, end-of-period lump-sum reward e β(T t)h(Xa T ) where β is a given, known discount factor, and continuously update and improve her actions based on these observations.4

A critical question is how to strategically generate these trial-and-error sequences of actions. The idea is randomization; namely, the agent devises and employs a stochastic policy, which is a probability distribution on the action space, to generate actions according to the current time state pair. It is important to note that such randomization itself is independent of the underlying Brownian motion W, the random source of the original control problem that stands for the environmental noises. Speciﬁcally, assume the probability space is rich enough to support uniformly distributed random variables on [0, 1] that is independent of W, and then such a uniform random variable can be used to generate other random variables with density functions. Let {Zt, 0 t T} be a process of mutually independent copies of a uniform random variable on [0, 1], the construction of which requires a suitable extension of probability space; see Sun (2006) for details. We then further expand the ﬁltered probability space to (Ω, F, P; {Fs}s 0) where Fs = FW s σ(Zt, 0 t s) and the probability measure P, now deﬁned on FT , is an extension from PW (i.e. the two probability measures coincide when restricted to FW T ). Let π : (t, x) [0, T] Rd 7 π( |t, x) P(A) be a given (feedback) policy, where P(A) is a suitable collection of probability density functions.5 At each time s, an action as is generated or sampled from the distribution π( |s, Xs). Fix a stochastic policy π and an initial time state pair (t, x). Consider the following SDE

d Xπ s = b(s, Xπ s , aπ s )ds + σ(s, Xπ s , aπ s )d Ws, s [t, T]; Xπ t = x (6)

deﬁned on (Ω, F, P; {Fs}s 0), where aπ = {aπ s , t s T} is an {Fs}s 0-progressively measurable action process generated from π. (6) is an SDE with random coeﬃcients, whose well-posedness (i.e. existence and uniqueness of solution) is established in Yong and Zhou (1999, Chapter 1, Theorem 6.16) under Assumption 1. Fix aπ, the unique solution to (6), Xπ = {Xπ s , t s T}, is the sample state process corresponding to aπ that solves (1).6 Moreover, following Wang et al. (2020), we add an entropy regularizer to the reward

4. This procedure applies to both the oﬄine and online settings. In the former, the agent can repeatedly try diﬀerent sequences of actions over the full time period [0, T], record the corresponding state processes and payoﬀs, and then learn and update the policy based on the resulting dataset. In the latter, the agent updates the actions as she goes, based on all the up-to-date historical observations. 5. Here we assume that the action space A is continuous and randomization is restricted to those distributions that have density functions. The analysis and results of this paper can be easily extended to the cases of discrete action spaces and/or randomization with probability mass functions. 6. Here, Xπ also depends on the speciﬁc copy aπ sampled from π; however, to ease notation we write Xπ

instead of Xπ,aπ.

Jia and Zhou

function to encourage exploration (represented by the stochastic policy), leading to

J(t, x; π) =EP Z T

t e β(s t) [r(s, Xπ s , aπ s ) γ log π(aπ s |s, Xπ s )] ds

+ e β(T t)h(Xπ T ) Xπ t = x , (7)

where EP is the expectation with respect to both the Brownian motion and the action randomization, and γ 0 is a given weighting parameter on exploration also known as the temperature parameter. Wang et al. (2020) consider the following SDE

d Xs = b s, Xs, π( |s, Xs) dt + σ s, Xs, π( |s, Xs) d Ws, s [t, T]; Xt = x, (8)

b s, x, π( ) = Z

A b(s, x, a)π(a)da, σ s, x, π( ) =

A σσ (s, x, a)π(a)da.

Intuitively, based on the law of large number, the solution of (8), denoted by { Xπ s , t s T}, is the limit of the average of the sample trajectories Xπ over randomization (i.e., copies of π). Rigorously, it follows from the property of Markovian projection due to Brunick and Shreve (2013, Theorem 3.6) that Xπ s and Xπ s have the same distribution for each s [t, T]. Hence, the value function (7) is identical to

J(t, x; π) =EPW Z T

t e β(s t) R(s, Xπ s , π( |s, Xπ s ))ds + e β(T t)h( Xπ T ) Xπ t = x . (9)

where R(s, x, π) = R

A[r(s, x, a) γ log π(a)]π(a)da. The function J( , ; π) is called the value function of the policy π, and the task of RL is to ﬁnd J (t, x) = max π Π J(t, x; π), (10)

where Π stands for the set of admissible (stochastic) policies. The following gives the precise deﬁnition of admissible policies.

Deﬁnition 1 A policy π = π( | , ) is called admissible if

(i) π( |t, x) P(A), supp π( |t, x) = A for every (t, x) [0, T] Rd, and π(a|t, x) : (t, x, a) [0, T] Rd A R is measurable;

(ii) π(a|t, x) is continuous in (t, x) and uniformly Lipschitz continuous in x in the total variation distance, i.e., R

A |π(a|t, x) π(a|u, x )|da 0 as (u, x ) (t, x), and there is a constant C > 0 independent of (t, a) such that Z

A |π(a|t, x) π(a|t, x )|da C|x x |, x, x Rd;

Continuous-Time q-Learning

(iii) For any given α > 0, the entropy of π and its α-moment have polynomial growth in x, i.e., there are constants C = C(α) > 0 and µ = µ (α) 1 such that | R

A log π(a|t, x)π(a|t, x)da| C(1 + |x|µ ), and R

A |a|απ(a|t, x)da C(1 + |x|µ ) (t, x).

Under Assumption 1 along with (i) and (ii) in Deﬁnition 1, Jia and Zhou (2022b, Lemma 2) show that the SDE (8) admits a unique strong solution since its coeﬃcients are locally Lipschitz continuous and have linear growth; see Mao (2007, Chapter 2, Theorem 3.4) for the general result. In addition, the growth condition (iii) guarantees that the new payoﬀ function (9) is ﬁnite. More discussions on the conditions required in the above deﬁnition can be found in Jia and Zhou (2022b). Moreover, Deﬁnition 1 only contains the conditions on the policy, hence they can be veriﬁed. Note that the problem (8) (9) is mathematically equivalent to the problem (6) (7). Nevertheless, they serve diﬀerent purposes in our study: the former provides a framework for theoretical analysis of the value function, while the latter directly involves observable samples for algorithm design.

2.3 Some useful preliminary results

Lemma 2 in Jia and Zhou (2022b) yields that the value function of a given admissible policy π = π( | , ) Π satisﬁes the following PDE

t (t, x; π) + Z

H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) γ log π(a|t, x) π(a|t, x)da

βJ(t, x; π) = 0,

J(T, x; π) = h(x).

This is a version of the celebrated Feynman Kac formula in the current RL setting. Under Assumption 1 as well as the admissibility conditions in Deﬁnition 1, the PDE (11) admits a unique viscosity solution among polynomially growing functions (Beck et al., 2021, Theorem 1.1). On the other hand, Tang et al. (2022, Section 5) derive the following exploratory HJB equation for the problem (8) (9) satisﬁed by the optimal value function:

t (t, x) + sup π P(A)

H t, x, a, J

x (t, x), 2J

x2 (t, x) γ log π(a) π(a)da βJ (t, x; π) = 0,

J (T, x) = h(x). (12) Moreover, the optimal (stochastic) policy is a Gibbs measure or Boltzmann distribution:

π ( |t, x) exp 1

γ H t, x, , J

x (t, x), 2J

or, after normalization,

π (a|t, x) = exp{ 1

γ H t, x, a, J

x (t, x), 2J

x2 (t, x) } R

γ H t, x, a, J

x (t, x), 2J

x2 (t, x) }da . (13)

Jia and Zhou

This result shows that one should use a Gibbs sampler in general to generate trial-and-error strategies to explore the environment when the regularizer is chosen to be entropy.7

In the special case when the system dynamics are linear in action and payoﬀs are quadratic in action, the Hamiltonian is a quadratic function of action and Gibbs thus specializes to Gaussian under mild conditions. Plugging (13) to (12) to replace the supremum operator therein leads to the following equivalent form of the exploratory HJB equation

t (t, x) + γ log Z

γ H t, x, a, J

x (t, x), 2J

x2 (t, x) da βJ (t, x) = 0,

J (T, x) = h(x). (14)

More theoretical properties regarding (14) and consequently the problem (6) (7) including its well-posedness can be found in Tang et al. (2022). The Gibbs measure can also be used to derive a policy improvement theorem. Wang and Zhou (2020) prove such a theorem in the context of continuous-time mean variance portfolio selection, which is essentially a linear quadratic (LQ) control problem. The following extends that result to the general case.

Theorem 2 For any given π Π, deﬁne π ( |t, x) exp{ 1

γ H t, x, , J

x(t, x; π), 2J

x2 (t, x; π) }. If π Π, then J(t, x; π ) J(t, x; π).

Moreover, if the following map

I(π) = exp{ 1

γ H t, x, , J

x(t, x; π), 2J

x2 (t, x; π) } R

γ H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) }da , π Π

has a ﬁxed point π , then π is the optimal policy.

At this point, Theorem 2 remains a theoretical result that cannot be directly applied to learning procedures, because the Hamiltonian H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) depends on the knowledge of the model parameters which we do not have in the RL context. To develop implementable algorithms, we turn to Q-learning.

3. q-Function in Continuous Time: The Theory

This section is the theoretical foundation of the paper, with the analysis entirely in continuous time. We start with deﬁning a Q-function parameterized by a time step t > 0, and then motivate the notion of q-function that is independent of t. We further provide martingale characterizations of the q-function.

7. We stress here that the Gibbs sampler is due to the entropy-regularizer, and it is possible to have other types of distribution for exploration. For example, Han et al. (2022) propose a family of regularizers that lead to exponential, uniform, and ε-greedy exploratory policies. O Donoghue (2021); Gao et al. (2020) study how to choose time and state dependent temperature parameters. One may also carry out exploration via randomizing value function instead of policies; see e.g. Osband et al. (2019).

Continuous-Time q-Learning

3.1 Q-function

Tallec et al. (2019) consider a continuous-time MDP and then discretize it upfront to a discrete-time MDP with time discretization δt. In that setting, the authors argue that there is no Q-function in continuous time because the Q-function collapses to the value function when δt is inﬁnitesimally small. Here, we take an entirely diﬀerent approach that does not involve time discretization. Incidentally, by comparing the form of the Gibbs measure (13) with that of the widely employed Boltzmann exploration for learning in discrete-time MDPs, Gao et al. (2020) and Zhou (2021) conjecture that the continuous-time counterpart of the Q-function is the Hamiltonian. Our approach will also provide a rigorous justiﬁcation on and, indeed, a reﬁnement of this conjecture. Given π Π and (t, x, a) [0, T) Rd A, consider a perturbed policy of π, denoted by π, as follows: It takes the action a A on [t, t + t) where t > 0, and then follows π on [t + t, T]. The corresponding state process X π, given X π t = x, can be broken into two pieces. On [t, t + t), it is Xa which is the solution to

d Xa s = b(s, Xa s , a)ds + σ(s, Xa s , a)d Ws, s [t, t + t); Xa t = x,

while on [t+ t, T], it is Xπ following (6) but with the initial time state pair (t+ t, Xa t+ t). With t > 0 ﬁxed, we introduce the ( t-parameterized) Q-function, denoted by Q t(t, x, a; π), to be the expected reward drawn from the perturbed policy, π:8

Q t(t, x, a; π)

t e β(s t)r(s, Xa s , a)ds

t+ t e β(s t)[r(s, Xπ s , aπ s ) γ log π(aπ s |s, Xπ s )]ds + e β(T t)h(Xπ T ) X π t = x .

Recall that our formulation (7) includes an entropy regularizer term that incentivizes exploration using stochastic policies. However, in deﬁning Q t(t, x, a; π) above we do not include such a term on [t, t + t) for π because a deterministic constant action a is applied whose entropy is excluded. The following proposition provides an expansion of this Q-function in t.

Proposition 3 We have

Q t(t, x, a; π)

=J(t, x; π) + J

t (t, x; π) + H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) βJ(t, x; π) t + o( t).

So, this t-based Q-function includes three terms. The leading term is the value function J, which is independent of the action a taken. This is natural because we only apply a for a time window of length t and hence the impact of this action is only of the order t.

8. In the existing literature, Q-learning with entropy regularization is often referred to as the soft Qlearning with the associated soft Q-function . A review of and more discussions on soft Q-learning in discrete time can be found in Appendix A as well as in Haarnoja et al. (2018b); Schulman et al. (2017).

Jia and Zhou

Next, the ﬁrst-order term of t is the Hamiltonian plus the temporal change of the value function (consisting of the derivative of the value function in time and the discount term, both of which would disappear for stationary and non-discounted problems), in which only the Hamiltonian depends on the action a. It is interesting to note that these are the same terms in the Feynman Kac PDE (11), less the entropy term. Finally, the higher-order residual term, o( t), comes from the approximation error of the integral over [t, t+ t] and can be ignored when such errors are aggregated as t gets smaller. Proposition 3 yields that this Q-function also collapses to the value function when t 0, similar to what Tallec et al. (2019) claim. However, we must emphasize that our Q-function is not the same as the one used in Tallec et al. (2019), who discretize time upfront and study the resulting discrete-time MDP. The corresponding Q-function and value function therein exist based on the discrete-time RL theory, denoted respectively by Qδt, V δt

where δt is the time discretization size. Then Tallec et al. (2019) argue that the advantage function Aδt = Qδt V δt can be properly scaled by δt, leading to the existence of limδt 0 Aδt

δt . In short, the Q-function and the resulting algorithms in Tallec et al. (2019) are still in the realm of discrete-time MDPs. By contrast, Q t(t, x, a; π) introduced here is a continuous-time notion. It is the value function of a policy that applies a constant action in [t, t + t] and follows π thereafter. So t here is a parameter representing the length of period in which the constant action is taken, not the time discretization size as in Tallec et al. (2019). Most importantly, our Q-function is just a tool used to introduce the qfunction, the centerpiece of this paper. Having said all these, we recognize that limδt 0 Aδt

δt as identiﬁed by Tallec et al. (2019) is the counterpart of our q-function in their setting, and that we arrive at this same object in two diﬀerent ways (discretizing upfront then taking δt 0, versus a true continuous-time analysis).

3.2 q-function

Since the leading term in the Q-function, Q t(t, x, a; π), coincides with the value function of π and hence can not be used to rank action a, we focus on the ﬁrst-order approximation, which gives an inﬁnitesimal state action value. Motivated by this, we deﬁne the q-function as follows:

Deﬁnition 4 The q-function of the problem (6) (7) associated with a given policy π Π is deﬁned as

q(t, x, a; π) = J

t (t, x; π) + H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) βJ(t, x; π),

(t, x, a) [0, T] Rd A. (16)

Clearly, this function is the ﬁrst-order derivative of the Q-function with respect to t, as an immediate consequence of Proposition 3:

Corollary 5 We have

q(t, x, a; π) = lim t 0 Q t(t, x, a; π) J(t, x; π)

Continuous-Time q-Learning

Some remarks are in order. First, the q-function is a function of the time state action triple under a given policy, analogous to the conventional Q-function for MDPs; yet it is a continuous-time notion because it does not depend on any time-discretization. This feature is a vital advantage for learning algorithm design as Tallec et al. (2019) point out that the performance of RL algorithms is very sensitive with respect to the time discretization. Second, the q-function is related to the advantage function in the MDP literature (e.g., Baird 1993; Mnih et al. 2016), which describes the diﬀerence between the state action value and the state value. Here in this paper, the q-function reﬂects the instantaneous advantage rate of a given action at a given time state pair under a given policy. We also notice that in (16) only the Hamiltonian depends on a; hence the improved policy presented in Theorem 2 can also be expressed in terms of the q-function:

π ( |t, x) exp 1

γ H t, x, , J

x(t, x; π), 2J

x2 (t, x; π) exp 1

γ q(t, x, ; π) .

Consequently, if we can learn the q-function q( , , ; π) under any policy π, then it follows from Theorem 2 that we can improve π by implementing a Gibbs measure over the q-values, analogous to, say, ε-greedy policy in classical Q-learning (Sutton and Barto, 2018, Chapter 6). Finally, the analysis so far justiﬁes the aforementioned conjecture about the proper continuous-time version of the Q-function, and provides a theoretical interpretation to the widely used Boltzmann exploration in RL. The main theoretical results of this paper are martingale characterizations of the qfunction, which can in turn be employed to devise algorithms for learning crucial functions including the q-function, in the same way as in applying the martingale approach for PE (Jia and Zhou, 2022a).9

The following ﬁrst result characterizes the q-function associated with a given policy π, assuming that its value function has already been learned and known.

Theorem 6 Let a policy π Π, its value function J and a continuous function ˆq : [0, T] Rd A R be given. Then

(i) ˆq(t, x, a) = q(t, x, a; π) for all (t, x, a) [0, T] Rd A if and only if for all (t, x) [0, T] Rd, the following process

e βs J(s, Xπ s ; π) + Z s

t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du (18)

is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) under π with Xπ t = x.

(ii) If ˆq(t, x, a) = q(t, x, a; π) for all (t, x, a) [0, T] Rd A, then for any π Π and for all (t, x) [0, T] Rd, the following process

e βs J(s, Xπ s ; π) + Z s

t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du (19)

is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) under π

with initial condition Xπ t = x.

9. Some discrete-time counterparts of these results are presented in Appendix A.

Jia and Zhou

(iii) If there exists π Π such that for all (t, x) [0, T] Rd, (19) is an ({Fs}s 0, P)- martingale where Xπ t = x, then ˆq(t, x, a) = q(t, x, a; π) for all (t, x, a) [0, T] Rd A.

Moreover, in any of the three cases above, the q-function satisﬁes Z

q(t, x, a; π) γ log π(a|t, x) π(a|t, x)da = 0, (t, x) [0, T] Rd. (20)

Jia and Zhou (2022a) unify and characterize the learning of value function (i.e., PE) by martingale conditions of certain stochastic processes. Theorem 6 shows that learning the q-function again boils down to maintaining the martingality of the processes (18) or (19). However, there are subtle diﬀerences between these martingale conditions. Jia and Zhou (2022a) consider only deterministic policies so the martingales therein are with respect to ({FW s }s 0, PW ). Jia and Zhou (2022b) extends the policy evaluation to include stochastic policies but the related martingales are in terms of the averaged state Xπ and hence also with respect to ({FW s }s 0, PW ). By contrast, the martingality in Theorem 6 is with respect to ({Ft}t 0, P), where the ﬁltration {Ft}t 0 is the enlarged one that includes the randomness for generating actions. Note that the ﬁltration determines the class of test functions necessary to characterize a martingale through the so-called martingale orthogonality conditions (Jia and Zhou, 2022a). So the above martingale conditions suggest that one ought to choose test functions dependent of the past and current actions when designing algorithms to learn the q-function. Moreover, the q-function can be interpreted as a compensator to guarantee the martingality over this larger information ﬁeld.10

Theorem 6-(i) informs on-policy learning, namely, learning the q-function of the given policy π, called the target policy, based on data {(s, Xπ s , aπ s ), t s T} generated by π itself. Nevertheless, one of the key advantages of classical Q-learning for MDPs is that it also works for oﬀ-policy learning, namely, learning the q-function of a given target policy π based on data generated by a diﬀerent admissible policy π , called a behavior policy. On-policy and oﬀ-policy reﬂect two diﬀerent learning settings depending on the availability of data and/or the choice of a learning agent. When data generation can be controlled by the agent, conducting on-policy learning is possible although she could still elect oﬀ-policy learning. By contrast, when data generation is not controlled by the agent and she has to rely on data under other policies, then it becomes oﬀ-policy learning. Theorem 6-(ii) and -(iii) stipulate that our q-learning in continuous time can also be oﬀ-policy. Finally, (20) is a consistency condition to uniquely identify the q-function. If we only consider deterministic policies of the form a( , ) in which case γ = 0 and π( |t, x) degenerates into the Dirac measure concentrating on a(t, x), then (20) reduces to

q(t, x, a(t, x); a) = 0,

which corresponds to equation (23) in Tallec et al. (2019). The following result strengthens Theorem 6, in the sense that it characterizes the value function and the q-function under a given policy simultaneously.

10. More precisely, R t 0 e βsq(s, Xπ s , aπ s ; π)ds is the compensator of e βt J(t, Xπ t ; π) + R t 0 e βsr(s, Xπ s , aπ s )ds. Recall that a compensator of an adapted stochastic process Yt is a predictable process At such that Yt At is a local martingale. Intuitively speaking, the compensator is the drift part of a diﬀusion process, extracting the trend of the process.

Continuous-Time q-Learning

Theorem 7 Let a policy π Π, a function ˆJ C1,2 [0, T) Rd C [0, T] Rd with polynomial growth, and a continuous function ˆq : [0, T] Rd A R be given satisfying

ˆJ(T, x) = h(x), Z

ˆq(t, x, a) γ log π(a|t, x) π(a|t, x)da = 0, (t, x) [0, T] Rd. (21)

(i) ˆJ and ˆq are respectively the value function and the q-function associated with π if and only if for all (t, x) [0, T] Rd, the following process

e βs ˆJ(s, Xπ s ) + Z s

t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du (22)

is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) with Xπ t = x.

(ii) If ˆJ and ˆq are respectively the value function and the q-function associated with π, then for any π Π and all (t, x) [0, T] Rd, the following process

e βs ˆJ(s, Xπ s ) + Z s

t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du (23)

is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) under π

with initial condition Xπ t = x.

(iii) If there exists π Π such that for all (t, x) [0, T] Rd, (23) is an ({Fs}s 0, P)- martingale where Xπ t = x, then ˆJ and ˆq are respectively the value function and the q-function associated with π.

Moreover, in any of the three cases above, if it holds further that π(a|t, x) = exp{ 1

γ ˆq(t,x,a)} R

γ ˆq(t,x,a)}da,

then π is the optimal policy and ˆJ is the optimal value function.

Theorem 7 characterizes the value function and the q-function in terms of a single martingale condition, in each of the on-policy and oﬀ-policy settings, which will be the foundation for designing learning algorithms in this paper. The conditions in (21) ensure that ˆJ corresponds to the correct terminal payoﬀfunction and ˆq corresponds to the soft q-function with the entropy regularizer. In particular, if the policy is the Gibbs measure generated by 1

γ ˆq, then ˆJ and ˆq are respectively the value function and the q-function under the optimal policy. Finally, note the (subtle) diﬀerence between Theorem 7 and Theorem 6. There may be multiple ( ˆJ, ˆq) pairs satisfying the martingale conditions of (22) or (23) in Theorem 7; so the conditions in (21) are required for identifying the correct value function and q-function.11 By contrast, these conditions are implied if the correct value function has already been known and given, as in Theorem 6.

11. If the terminal condition ˆJ(T, x) = h(x) is changed to ˆJ(T, x) = ˆh(x), then ( ˆJ, ˆq) satisfying the martingale condition (22) or (23) would be the value function and q-function corresponding to a diﬀerent learning task with the pair of running and terminal reward functions being (r, ˆh). If the normalization condition on ˆq is missing, say R

A ˆq(t, x, a) γ log π(a|t, x) π(a|t, x)da = ˆr(t, x) holds instead, then ( ˆJ, ˆq) would be the value function and q-function corresponding to the learning task with the running and terminal reward functions (r ˆr, h).

Jia and Zhou

3.3 Optimal q-function

We now focus on the q-function associated with the optimal policy π in (13). Based on Deﬁnition 4, it is deﬁned as

q (t, x, a) = J

t (t, x) + H t, x, a, J

x (t, x), 2J

x2 (t, x) βJ (t, x), (24)

where J is the optimal value function satisfying the exploratory HJB equation (14).

Proposition 8 We have Z

γ q (t, x, a)}da = 1, (25)

for all (t, x), and consequently the optimal policy π is

π (a|t, x) = exp{1

γ q (t, x, a)}. (26)

As seen from its proof (in Appendix C), Proposition 8 is an immediate consequence of the exploratory HJB equation (14). Indeed, (25) is equivalent to (14) when viewed as an equation for J due to (24). However, in terms of q , satisfying (25) is only a necessary condition or a minimum requirement for being the optimal q-function, and by no means suﬃcient, in the current RL setting. This is because in the absence of knowledge about the primitives b, σ, r, h, we are unable to determine J from q by their relationship (24) which can be viewed as a PDE for J . To fully characterize J as well as q , we will have to resort to martingale condition, as stipulated in the following result.

Theorem 9 Let a function c J C1,2 [0, T) Rd C [0, T] Rd with polynomial growth and a continuous function bq : [0, T] Rd A R be given satisfying

c J (T, x) = h(x), Z

γ bq (t, x, a)}da = 1, (t, x) [0, T] Rd. (27)

(i) If c J and bq are respectively the optimal value function and the optimal q-function, then for any π Π and all (t, x) [0, T] Rd, the following process

e βsc J (s, Xπ s ) + Z s

t e βu[r(u, Xπ u , aπ u ) bq (u, Xπ u , aπ u )]du (28)

is an ({Fs}s 0, P)-martingale, where {Xπ s , t s T} is the solution to (6) under π with Xπ t = x. Moreover, in this case, c π (a|t, x) = exp{ 1

γ bq (t, x, a)} is the optimal policy.

(ii) If there exists one π Π such that for all (t, x) [0, T] Rd, (28) is an ({Fs}s 0, P)- martingale, then c J and bq are respectively the optimal value function and the optimal q-function.

Continuous-Time q-Learning

Theorem 9 lays a foundation for oﬀ-policy learning without having to go through iterative policy improvement. We emphasize here that to learn the optimal policy and the optimal q-function, it is infeasible to conduct on-policy learning because π is unknown and hence the constraints (21) can not be checked nor enforced. The new constraints in (27) no longer depend on policies and instead give basic requirements for the candidate value function and q-function. Maintaining the martingality of (28) under any admissible policy is equivalent to the optimality of the candidate value function and q-function. Moreover, Theorem 9-(ii) suggests that we do not need to use data generated by all admissible policies. Instead, those generated by one any one policy should suﬃce; e.g. we could take π(a|t, x) = exp{ 1

γ bq (t, x, a)} as the behavior policy. Finally, let us remark that Theorems 7 and 9 are independent of each other in the sense that one does not imply the other. One may design diﬀerent algorithms from them depending on diﬀerent cases and needs.

4. q-Learning Algorithms When Normalizing Constant Is Available

This section and the next discuss algorithm design based on the theoretical results in the previous section. Theorems 7 and 9 provide the theoretical foundation for designing both on-policy and oﬀ-policy algorithms to simultaneously learn and update the value function (critic) and the policy (actor) with proper function approximations of J and q. In this section, we present these actor critic, q-learning algorithms when the normalizing constant involved in the Gibbs measure is available or computable.12

4.1 q-Learning algorithms

Given a policy π, to learn its associated value function and q-function, we denote by Jθ

and qψ the parameterized function approximators that satisfy the two constraints:

Jθ(T, x) = h(x), Z

qψ(t, x, a) γ log π(a|t, x) π(a|t, x)da = 0, (29)

for all θ Θ RLθ, ψ Ψ RLψ. Then Theorem 2 suggests that the policy can be improved by

πψ(a|t, x) = exp{ 1

γ qψ(t, x, a)} R

γ qψ(t, x, a)}da, (30)

while by assumption the normalizing constant R

γ qψ(t, x, a)}da for any ψ Ψ can be explicitly computed. Next, we can use the data generated by πψ and apply the martingale condition in Theorem 7 to learn its value function and q-function, leading to an iterative actor critic algorithm.

12. Conventional Q-learning or SARSA is often said to be value-based in which the Q-function is a state action value function serving as a critic. In q-learning, as Theorem 6 stipulates, the q-function is uniquely determined by the value function J as its compensator. Hence, q plays a dual role here: it is both a critic (as it can be derived endogenously from the value function) and an actor (as it derives an improved policy). This is why we call the q-learning algorithms actor critic, if with a slight abuse of the term because the actor here is not purely exogenous as with, say, policy gradient.

Jia and Zhou

Alternatively, we may choose to directly learn the optimal value function and q-function based on Theorem 9. In this case, the approximators Jθ and qψ should satisfy

Jθ(T, x) = h(x), Z

γ qψ(t, x, a)}da = 1. (31)

Note that if the policy π in (29) is taken in the form (30), then the second equation in (31) is automatically satisﬁed. Henceforth in this section, we focus on deriving algorithms based on Theorem 9 and hence always impose the constraints (31). In this case, any approximator of the q-function directly gives that of the policy via πψ(a|t, x) = exp{ 1

γ qψ(t, x, a)}; thereby we avoid learning the q-function and the policy separately. Moreover, making use of (31) typically results in more special parametric form of the q-function approximator qψ, potentially facilitating more eﬃcient learning. Here is an example.

Example 1 When the system dynamic is linear in action a and reward is quadratic in a, the Hamiltonian, and hence the q-function, is quadratic in a. So we can parameterize

qψ(t, x, a) = 1

2qψ 2 (t, x) a qψ 1 (t, x) a qψ 1 (t, x) +qψ 0 (t, x), (t, x, a) [0, T] Rd Rm,

with qψ 1 (t, x) Rm, qψ 2 (t, x) Sm ++ and qψ 0 (t, x) R. The corresponding policy is a multivariate normal distribution for which the normalizing constant can be computed:

πψ( |t, x) = N qψ 1 (t, x), γ qψ 2 (t, x) 1 ,

with its entropy value being 1

2 log det qψ 2 (t, x) + m

2 log 2πeγ. The second constraint on qψ

in (31) then yields qψ 0 (t, x) = γ

2 log det qψ 2 (t, x) mγ

2 log 2π. This in turn gives rise to a more speciﬁc parametric form

qψ(t, x, a) = 1

2qψ 2 (t, x) a qψ 1 (t, x) a qψ 1 (t, x) + γ

2 log det qψ 2 (t, x) mγ

The next step in algorithm design is to update (θ, ψ) by enforcing the martingale condition stipulated in Theorem 9 and applying the techniques developed in Jia and Zhou (2022a). A number of algorithms can be developed based on two types of objectives: to minimize the martingale loss function or to satisfy the martingale orthogonality conditions. The latter calls for solving a system of equations for which there are further two diﬀerent techniques: applying stochastic approximation as in the classical temporal diﬀerence (TD) algorithms or minimizing a quadratic function in lieu of the system of equations as in the generalized methods of moment (GMM). For reader s convenience, we summarize these methods in the q-learning context below.

Minimize the martingale loss function:

e β(T t)h(Xπψ T ) Jθ(t, Xπψ t ) + Z T

t e β(s t)[r(s, Xπψ s , aπψ s ) qψ(s, Xπψ s , aπψ s )]ds

Continuous-Time q-Learning

This method is intrinsically oﬄine because the loss function involves the whole horizon [0, T]; however we are free to choose optimization algorithms to update the parameters (θ, ψ). For example, we can apply stochastic gradient decent to update

θ (t, Xπψ t )Gt:T dt

t e β(s t) qψ

ψ (s, Xπψ s , aπψ s )ds Gt:T dt,

where Gt:T = e β(T t)h(Xπψ T ) Jθ(t, Xπψ t )+ R T t e β(s t)[r(s, Xπψ s , aπψ s ) qψ(s, Xπψ s , aπψ s )]ds, and αθ and αψ are the learning rates. We present Algorithm 1 based on this updating rule. Note that this algorithm is analogous to the classical gradient Monte Carlo method or TD(1) for MDPs (Sutton and Barto, 2018) because full sample trajectories are used to compute gradients.

Choose two diﬀerent test functions ξt and ζt that are both Ft-adapted vector-valued stochastic processes, and consider the following system of equations in (θ, ψ):

0 ξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i = 0,

0 ζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i = 0.

To solve these equations iteratively, we use stochastic approximation to update (θ, ψ) either oﬄine by13

0 ξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i ,

0 ζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i ,

or online by

θ θ + αθξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i ,

ψ ψ + αψζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i .

Typical choices of test functions are ξt = Jθ

θ (t, Xπψ t ), ζt = qψ

ψ (t, Xπψ t , aπψ t ) yielding algorithms that would be closest to TD-style policy evaluation and Q-learning (e.g., in Tallec et al. 2019) and at the same time belong to the more general semi-gradient methods (Sutton and Barto, 2018). We present the online and oﬄine q-learning algorithms, Algorithms 2 and 3 respectively, based on these test functions. We must, however,

13. Here, when implementing in a computer program, d Jθ is the timestep-wise diﬀerence in Jθ when time is discretized, or indeed it is the temporal-diﬀerence (TD) of the value function approximator Jθ.

Jia and Zhou

stress that testing against these two speciﬁc functions is theoretically not suﬃcient to guarantee the martingale condition. Using them with a rich, large-dimensional parametric family for J and q makes approximation to the martingale condition ﬁner and ﬁner. Moreover, the corresponding stochastic approximation algorithms are not guaranteed to converge in general, and the test functions have to be carefully selected (Jia and Zhou, 2022a) depending on (J, q). Some of the diﬀerent test functions leading to diﬀerent types of algorithms in the context of policy evaluation are discussed in Jia and Zhou (2022a).

Choose the same types of test functions ξt and ζt as above but now minimize the GMM objective functions:

0 ξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i

0 ξt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i ,

0 ζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i

0 ζt h d Jθ(t, Xπψ t ) + r(t, Xπψ t , aπψ t )dt qψ(t, Xπψ t , aπψ t )dt βJθ(t, Xπψ t )dt i ,

where Aθ SLθ ++, Aψ SLψ ++. Typical choices of these matrices are Aθ = ILθ and Aψ = ILψ, or Aθ = (EP[ R T 0 ξtξ t dt]) 1 and Aψ = (EP[ R T 0 ζtζ t dt]) 1. Again, we refer the reader to Jia and Zhou (2022a) for discussions on these choices and the connection with the classical GTD algorithms and GMM method.

4.2 Connections with SARSA

With a ﬁxed time-step size t, we can deﬁne our Q-function, Q t(t, x, a; π), by (15), parameterize it by Qϕ t(t, x, a; π), and then apply existing (big) Q-learning algorithms such as SARSA (cf., Sutton and Barto 2018, Chapter 10) to learn the parameter ϕ. Now, how do we compare this approach of t-based Q-learning with that of q-learning? Equation (15) suggests that Q t(t, x, a; π) can be approximated by

Q t(t, x, a; π) J(t, x; π) + q(t, x, a; π) t.

Our q-learning method is to learn separately the zeroth-order and ﬁrst-order terms of Q t(t, x, a; π) in t, and the two terms are in themselves independent of t. Our approach therefore has a true continuous-time nature without having to rely on t or any time disretization, which theoretically facilitates the analysis carried out in the previous section and algorithmically mitigates the high sensitivity with respect to time discretization. Our approach is similar to that introduced in Baird (1994) and Tallec et al. (2019) where the value function and the (rescaled) advantage function are approximated separately.

Continuous-Time q-Learning

Algorithm 1 Oﬄine Episodic q-Learning ML Algorithm Inputs: initial state x0, horizon T, time step t, number of episodes N, number of mesh grids K, initial learning rates αθ, αψ and a learning rate schedule function l( ) (a function of the number of episodes), functional forms of parameterized value function Jθ( , ) and q-function qψ( , , ) satisfying (31), and temperature parameter γ. Required program (on-policy): environment simulator (x , r) = Environment t(t, x, a) that takes current time state pair (t, x) and action a as inputs and generates state x

at time t + t and instantaneous reward r at time t as outputs. Policy πψ(a|t, x) = exp{ 1

γ qψ(t, x, a)}. Required program (oﬀ-policy): observations {atk, rtk, xtk+1}k=0, ,K 1 {xt K, h(xt K)} = Observation( t) including the observed actions, rewards, and state trajectories under the given behavior policy at the sampling time grids with step size t. Learning procedure:

Initialize θ, ψ. for episode j = 1 to N do

Initialize k = 0. Observe initial state x0 and store xtk x0. {On-policy case while k < K do

Generate action atk πψ( |tk, xtk). Apply atk to environment simulator (x, r) = Environment t(tk, xtk, atk), and observe new state x and reward r as output. Store xtk+1 x and rtk r. Update k k + 1. end while } {Oﬀ-policy case Obtain one observation {atk, rtk, xtk+1}k=0, ,K 1 {xt K, h(xt K)} = Observation( t). } For every k = 0, 1, , K 1, compute

Gtk:T = e β(T tk)h(xt K) Jθ(tk, xtk) +

i=k e β(ti tk)[rti qψ(ti, xti, ati)] t.

Update θ and ψ by

θ θ + l(j)αθ

θ (tk, xtk)Gtk:T t.

ψ ψ + l(j)αψ

ψ (tk, xtk, atk) t

Jia and Zhou

Algorithm 2 Oﬄine Episodic q-Learning Algorithm Inputs: initial state x0, horizon T, time step t, number of episodes N, number of mesh grids K, initial learning rates αθ, αψ and a learning rate schedule function l( ) (a function of the number of episodes), functional forms of parameterized value function Jθ( , ) and q-function qψ( , , ) satisfying (31), functional forms of test functions ξ(t, x t, a t) and ζ(t, x t, a t), and temperature parameter γ. Required program (on-policy): environment simulator (x , r) = Environment t(t, x, a) that takes current time state pair (t, x) and action a as inputs and generates state x

at time t + t and instantaneous reward r at time t as outputs. Policy πψ(a|t, x) = exp{ 1

γ qψ(t, x, a)}. Required program (oﬀ-policy): observations {atk, rtk, xtk+1}k=0, ,K 1 {xt K, h(xt K)} = Observation( t) including the observed actions, rewards, and state trajectories under the given behavior policy at the sampling time grids with step size t. Learning procedure:

Initialize θ, ψ. for episode j = 1 to N do

Initialize k = 0. Observe initial state x0 and store xtk x0. {On-policy case while k < K do

Generate action atk πψ( |tk, xtk). Apply atk to environment simulator (x, r) = Environment t(tk, xtk, atk), and observe new state x and reward r as outputs. Store xtk+1 x and rtk r. Update k k + 1. end while } {Oﬀ-policy case Obtain one observation {atk, rtk, xtk+1}k=0, ,K 1 {xt K, h(xt K)} = Observation( t). } For every k = 0, 1, , K 1, compute and store test functions ξtk = ξ(tk, xt0, , xtk, at0, , atk), ζtk = ζ(tk, xt0, , xtk, at0, , atk). Compute

i=0 ξti Jθ(ti+1, xti+1) Jθ(ti, xti) + rti t qψ(ti, xti, ati) t βJθ(ti, xti) t ,

i=0 ζti Jθ(ti+1, xti+1) Jθ(ti, xti) + rti t qψ(ti, xti, ati) t βJθ(ti, xti) t .

Update θ and ψ by θ θ + l(j)αθ θ.

ψ ψ + l(j)αψ ψ.

Continuous-Time q-Learning

Algorithm 3 Online-Incremental q-Learning Algorithm Inputs: initial state x0, horizon T, time step t, number of mesh grids K, initial learning rates αθ, αψ and learning rate schedule function l( ) (a function of the number of episodes), functional forms of parameterized value function Jθ( , ) and q-function qψ( , , ) satisfying (31), functional forms of test functions ξ(t, x t, a t) and ζ(t, x t, a t), and temperature parameter γ. Required program (on-policy): environment simulator (x , r) = Environment t(t, x, a) that takes current time state pair (t, x) and action a as inputs and generates state x

at time t + t and instantaneous reward r at time t as outputs. Policy πψ(a|t, x) = exp{ 1

γ qψ(t, x, a)}. Required program (oﬀ-policy): observations {a, r, x } = Observation(t, x; t) including the observed actions, rewards, and state when the current time-state pair is (t, x) under the given behavior policy at the sampling time grids with step size t. Learning procedure:

Initialize θ, ψ. for episode j = 1 to do

Initialize k = 0. Observe initial state x0 and store xtk x0. while k < K do

{On-policy case Generate action atk πψ( |tk, xtk). Apply atk to environment simulator (x, r) = Environment t(tk, xtk, atk), and observe new state x and reward r as outputs. Store xtk+1 x and rtk r. } {Oﬀ-policy case Obtain one observation atk, rtk, xtk+1 = Observation(tk, xtk; t). } Compute test functions ξtk = ξ(tk, xt0, , xtk, at0, , atk), ζtk = ζ(tk, xt0, , xtk, at0, , atk). Compute

δ = Jθ(tk+1, xtk+1) Jθ(tk, xtk) + rtk t qψ(tk, xtk, atk) t βJθ(tk, xtk) t,

Update θ and ψ by θ θ + l(j)αθ θ.

ψ ψ + l(j)αψ ψ.

Update k k + 1 end while end for

Jia and Zhou

However, as pointed out already, as much as the two approaches may lead to certain similar algorithms, they are diﬀerent conceptually. The ones in Baird (1994) and Tallec et al. (2019) are still based on discrete-time MDPs and hence depend on the size of time-discretization. By contrast, the value function and q-function in q-learning are well-deﬁned quantities in continuous time and independent of any time-discretization.

On the other hand, approximating the value function by Jθ and the q-function by qψ

separately yields a speciﬁc approximator of the Q-function by

Qθ,ψ t (t, x, a) Jθ(t, x) + qψ(t, x, a) t. (32)

With this Q-function approximator Qθ,ψ, one of our q-learning based algorithms actually recovers (a modiﬁcation of) SARSA, one of the most well-known conventional Q-learning algorithm.

To see this, consider state Xt at time t. Then the conventional SARSA with the approximator Qθ,ψ updates parameters (θ, ψ) by

Qθ,ψ t (t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t

Qθ,ψ t (t, Xt, aπψ t ) + r(t, Xt, aπψ t ) t βQθ,ψ t (t, Xt, aπψ t ) t Qθ,ψ t θ (t, Xt, aπψ t ),

Qθ,ψ t (t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t

Qθ,ψ t (t, Xt, aπψ t ) + r(t, Xt, aπψ t ) t βQθ,ψ t (t, Xt, aπψ t ) t Qθ,ψ t ψ (t, Xt, aπψ t ),

(33) where aπψ t πψ( |t, Xt) and aπψ t+ t πψ( |t + t, Xat t+ t).

Note that by (32), we have

Qθ,ψ t θ Jθ t θ , Qθ,ψ t ψ qψ t ψ t.

This means that Qθ,ψ t ψ has an additional order of t when compared with Qθ,ψ t θ , resulting in a much slower rate of update on ψ than θ. This observation is also made by Tallec et al. (2019), who suggest as a remedy modifying the update rule for Q-function by replacing the partial derivative of the Q-function with that of the advantage function. In our setting, this

means we change Qθ,ψ t ψ to qψ t ψ in the second update rule of (33).

Continuous-Time q-Learning

Now, the terms in the brackets in (33) can be further rewritten as:

Qθ,ψ t (t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t

Qθ,ψ t (t, Xt, aπψ t ) + r(t, Xt, aπψ t ) t βQθ,ψ t (t, Xt, aπψ t ) t

=Jθ(t + t, Xaπψ t t+ t) Jθ(t, Xt) + [qψ(t + t, Xaπψ t t+ t, aπψ t+ t) qψ(t, Xt, aπψ t )] t

+ r(t, Xt, aπψ t ) t βJθ(t, Xt) t βqψ(t, Xt, aπψ t )( t)2 γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t

=Jθ(t + t, Xaπψ t t+ t) Jθ(t, Xt) qψ(t, Xt, aπψ t ) t + r(t, Xt, aπψ t ) t βJθ(t, Xt) t

+ qψ(t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t) t βqψ(t, Xt, aπψ t )( t)2

d Jθ(t, Xt) qψ(t, Xt, aπψ t )dt + r(t, Xt, aπψ t )dt βJθ(t, Xt)dt

+ qψ(t + t, Xaπψ t t+ t, aπψ t+ t) γ log πψ(aπψ t+ t|t + t, Xaπψ t t+ t)

| {z } mean-0 term due to (29)

where in the last step we drop the higher-order small term ( t)2 while replacing the diﬀerence terms with diﬀerential ones. Comparing (33) with the updating rule in Algorithm 3 using test functions ξt = Jθ

θ (t, Xπψ t )

and ζt = qψ

ψ (t, Xπψ t , aπψ t ), we ﬁnd that the modiﬁed SARSA in Tallec et al. (2019) and Algorithm 3 diﬀer only by a term whose mean is 0 due to the constraint (29) on the qfunction. This term is solely driven by the action randomization at time t + t. Hence, the algorithm in Tallec et al. (2019) is noisier and may be slower in convergence speed compared with the q-learning one. In Section 7, we will numerically compare the results of the above Q-learning SARSA and q-learning.

5. q-Learning Algorithms When Normalizing Constant Is Unavailable

Theorem 7 assumes that the normalizing constant in the Gibbs measure is available so that one can enforce the constraint (21). So does Theorem 9 because the second constraint in (27) is exactly to verify the normalizing constant to be 1. But it is well known that in most high-dimensional cases computing this constant is a daunting, and often impossible task.

5.1 A stronger policy improvement theorem

To overcome the diﬃculty arising from an unavailable normalizing constant in soft Qlearning, Haarnoja et al. (2018b) propose a general method of using a family of stochastic policies whose densities can be easily computed to approximate π . In our current setting, denote by {πφ( |t, x)}φ Φ the family of density functions of some tractable distributions, e.g., the multivariate normal distribution πφ( |t, x) = N µφ(t, x), Σφ(t, x) , where µφ(t, x) Rm and Σφ(t, x) Sm ++. The learning procedure then proceeds as follows: start-

ing from a policy πφ within this family, project the desired policy exp{ 1

γ q(t,x, ;πφ)} R

γ q(t,x,a;πφ)}da to

Jia and Zhou

the set {πφ(a|t, x)}φ Φ by minimizing

min φ Φ DKL

πφ ( |t, x) exp{ 1

γ q(t, x, ; πφ)} R

γ q(t, x, a; πφ)}da

min φ Φ DKL

πφ ( |t, x) exp{ 1

γ q(t, x, ; πφ)} ,

where DKL f g := R

g(a)f(a)da is the Kullback Leibler (KL) divergence of two positive functions f, g with the same support on A, where f P(A) is a probability density function on A. This procedure may not eventually lead to the desired target policy, but the newly deﬁned policy still provably improves the previous policy, as indicated in the following stronger version of the policy improvement theorem that generalizes Theorem 2.

Theorem 10 Given (t, x) [0, T] Rd, if two policies π Π and π Π satisfy

π ( |t, x) exp{1

γ H t, x, , J

x(t, x; π), 2J

x2 (t, x; π) }

π( |t, x) exp{1

γ H t, x, , J

x(t, x; π), 2J

x2 (t, x; π) } ,

then J(t, x; π ) J(t, x; π).

Theorem 10 in itself is a general result comparing two given policies not necessarily within any tractable family of densities. However, an implication of the result is that, given a current tractable policy πφ, as long as we update it to a new tractable policy πφ with

πφ ( |t, x) exp{1

γ q(t, x, ; πφ)} DKL

πφ( |t, x) exp{1

γ q(t, x, ; πφ)} ,

then πφ improves upon πφ. Thus, to update πφ it suﬃces to solve the optimization problem

min φ Φ DKL

πφ ( |t, x) exp{1

γ q(t, x, ; πφ)}

log πφ (a|t, x) 1

γ q(t, x, a; πφ) πφ (a|t, x)da.

The gradient of the above objective function in φ is

log πφ (a|t, x) 1

γ q(t, x, a; πφ) πφ (a|t, x)da

log πφ (a|t, x) 1

γ q(t, x, a; πφ) πφ (a|t, x)

φ log πφ (a|t, x) πφ (a|t, x)da

log πφ (a|t, x) 1

γ q(t, x, a; πφ)

φ log πφ (a|t, x) πφ (a|t, x)da,

where we have noted Z

φ log πφ (a|t, x) πφ (a|t, x)da = Z

φ πφ (a|t, x)da = φ

A πφ (a|t, x)da = 0.

Continuous-Time q-Learning

Therefore, we may update φ each step incrementally by

φ φ γαφdt log πφ(aπφ t |t, Xt) 1

γ q(t, Xt, aπφ t ; πφ)

φ log πφ(aπφ t |t, Xt), (34)

where we choose the step size to be speciﬁcally γαφdt, for the reason that will become evident momentarily. The above updating rule seems to indicate that we would need to learn the q-function q(t, x, a; πφ) associated with the policy πφ. This is doable in view of Theorem 7, noting that {πφ( |t, x)}φ Φ is a tractable family of distributions so that the constraint (21) or (29) can be computed and enforced. However, we do not need to do so given that (34) only requires the q-value along the trajectory {(t, Xt, aπφ t ); 0 t T}, instead of the full functional form q( , , ; πφ). The former is easier to computed from the temporal diﬀerence learning, as will be evident in the next subsection.

5.2 Connections with policy gradient

Applying Itˆo s lemma to J( , ; π) and recalling Deﬁnition 4, we have the following important relation between the q-function and the temporal diﬀerence of the value function:

q(t, Xπ t , aπ t ; π)dt = d J(t, Xπ t ; π) + r(t, Xπ t , aπ t )dt βJ(t, Xπ t ; π)dt + { }d Wt. (35)

Plugging the above to (34) and ignoring the martingale diﬀerence term { }d Wt whose mean is 0, the gradient-based updating rule for φ becomes

φ φ + αφ h γ log πφ(aπφ t |t, Xπφ t )dt + d J(t, Xπφ t ; πφ) + r(t, Xπφ t , aπφ t )dt βJ(t, Xπφ t ; πφ)dt i

φ log πφ(aπφ t |t, Xπφ t ).

(36) In this updating rule, we only need to approximate J( , ; πφ), which is a problem of PE. Speciﬁcally, denote by Jθ( , ) the function approximator of J( , ; πφ), where θ Θ, and we may apply any PE methods developed in Jia and Zhou (2022a) to learn Jθ. Given this value function approximation, the updating rule (36) further specializes to

φ φ + αφ h γ log πφ(aπφ t |t, Xπφ t )dt + d Jθ(t, Xπφ t ) + r(t, Xπφ t , aπφ t )dt βJθ(t, Xπφ t )dt i

φ log πφ(aπφ t |t, Xπφ t ).

(37) Expression (37) coincides with the updating rule in the policy gradient method established in Jia and Zhou (2022b) when the regularizer therein is taken to be the entropy. Moreover, since we learn the value function and the policy simultaneously, the updating rule leads to actor critic type of algorithms. Finally, with this method, while learning the value function may involve martingale conditions as in Jia and Zhou (2022a), learning the policy does not. Schulman et al. (2017) note the equivalence between soft Q-learning and PG in discrete time. Here we present the continuous-time counterpart of the equivalence, nevertheless with a theoretical justiﬁcation, which recovers the PG based algorithms in Jia and Zhou (2022b) when combined with suitable PE methods.

Jia and Zhou

6. Extension to Ergodic Tasks

We now extend the previous study to ergodic tasks, in which the objective is to maximize the long-run average in the inﬁnite time horizon [0, ), the functions b, σ and r do not depend on time t explicitly, and h = 0. The set of admissible (stationary) policies can be similarly deﬁned in a straighforward manner. The regularized ergodic objective function is the long-run average:

lim inf T 1 T EPW Z T

A [r( Xπ s , a) γ log π(a| Xπ s )]π(a| Xπ s )dads Xπ t = x

= lim inf T 1 T EP Z T

t [r(Xπ s , aπ s ) γ log π(aπ s |Xπ s )]ds Xπ t = x ,

where γ 0 is the temperature parameter. We ﬁrst present the ergodic version of the Feynman Kac formula and the corresponding policy improvement theorem.

Lemma 11 Let π = π( | ) be a given admissible (stationary) policy.14 Suppose there is a function J( ; π) C2(Rd) with polynomial growth and a scalar V (π) R satisfying

x(x; π), 2J

x2 (x; π) γ log π(a|x) π(a|x)da V (π) = 0, x Rd. (38)

Then for any t 0,

V (π) = lim inf T 1

T EPW R T t R

A[r( Xπ s , a) γ log π(a| Xπ s )]π(a| Xπ s )dads Xπ t = x

= lim inf T 1

T EP R T t [r(Xπ s , aπ s ) γ log π(aπ s |Xπ s )]ds Xπ t = x .

(39) Moreover, if there are two admissible policies π and π such that for all x Rd,

π ( |x) exp{1

x(x; π), 2J

x2 (x; π) }

π( |x) exp{1

x(x; π), 2J

x2 (x; π) } ,

then V (π ) V (π).

We emphasize that the solution to (38) is a pair of (J, V ), where J is a function of the state and V R is a scalar. The long term average of the payoﬀdoes not depend on the initial state x nor the initial time t due to ergodicity, and hence is a constant as (39) implies. The function J, on the other hand, represents the ﬁrst-order approximation of the long-run average and is not unique. Indeed, for any constant c, (J + c, V ) is also a solution to (38). We refer to V as the value of the underlying problem and still refer to J as the

14. For such ergodic tasks, typically an admissible policies also require the process Xπ is ergodic (cf. Meyn and Tweedie 1993).

Continuous-Time q-Learning

value function. Lastly, since the value does not depend on the initial time, we will ﬁx the latter as 0 in the following discussions and applications of ergodic learning tasks. As with the episodic case, for any admissible policy π, we can deﬁne the t-parameterized Q-function as

Q t(x, a; π)

=EPW Z t+ t

t [r(Xa s , a) V (π)]ds + lim T EP Z T

t+ t [r(Xπ s , aπ s ) γ log π(aπ s |Xπ s )

V (π)]ds|Xa t+ t Xπ t = x

=EPW Z t+ t

t [r(Xa s , a) V (π)]ds + J(Xa t+ t; π) Xπ t = x

EPW lim T EP J(Xπ T ; π)|Xa t+ t Xπ t = x

=EPW Z t+ t

t [r(Xa s , a) V (π)]ds + J(Xa t+ t; π) J(x; π) Xπ t = x + J(x; π)

EPW lim T EP J(Xπ T ; π)|Xa t+ t Xπ t = x

=EPW Z t+ t

H Xa s , a, J

x(Xa s ; π), 2J

x2 (Xa s ; π) V (π) ds Xπ t = x + J(x; π)

EPW lim T EP J(Xπ T ; π)|Xa t+ t Xπ t = x

=J(x; π) + H x, a, J

x(x; π), 2J

x2 (x; π) V (π) t J + O ( t)2 ,

where we assumed that lim T EP J(Xπ T ; π)|Xπ t = x = J is a constant for any (t, x) [0, + ) Rd. The corresponding q-function is then deﬁned as

q(x, a; π) = H x, a, J

x(x; π), 2J

x2 (x; π) V (π).

As a counterpart to Theorem 7, we have the following theorem to characterize the value function, the value, and the q-function for both on-policy and oﬀ-policy ergodic learning problems. A counterpart to Theorem 9 can be similarly established, and we leave details to the reader.

Theorem 12 Let an admissible policy π, a number ˆV , a function ˆJ C2 Rd with polynomial growth, and a continuous function ˆq : Rd A R be given satisfying

lim T 1 T E[ ˆJ(Xπ T )] = 0, Z

ˆq(x, a) γ log π(a|x) π(a|x)da = 0, x Rd. (40)

Jia and Zhou

(i) ˆV , ˆJ and ˆq are respectively the value, the value function and the q-function associated with π if and only if for all x Rd, the following process

ˆJ(Xπ t ) + Z t

0 [r(Xπ s , aπ s ) ˆq(Xπ s , aπ s ) ˆV ]ds (41)

is an ({Ft}t 0, P)-martingale, where {Xπ t , 0 t < } is the solution to (6) with Xπ 0 = x.

(ii) If ˆV , ˆJ and ˆq are respectively the value, value function and the q-function associated with π, then for any admissible π and any x Rd, the following process

ˆJ(Xπ t ) + Z t

0 [r(Xπ u , aπ u ) ˆq(Xπ u , aπ u ) ˆV ]du (42)

is an ({Ft}t 0, P)-martingale, where {Xπ t , 0 t < is the solution to (6) under π

with initial condition Xπ 0 = x.

(iii) If there exists an admissible π such that for all x Rd, (42) is an ({Ft}t 0, P)- martingale where Xπ 0 = x, then ˆV , ˆJ and ˆq are respectively the value, value function and the q-function associated with π.

Moreover, in any of the three cases above, if it holds further that π(a|x) = exp{ 1

γ ˆq(x,a)} R

γ ˆq(x,a)}da,

then π is the optimal policy and ˆV is the optimal value.

Based on Theorem 12, we can design corresponding q-learning algorithms for ergodic problems, in which we learn V, Jθ and qψ at the same time. These algorithms have natural connections with SARSA as well as the PG-based algorithms developed in Jia and Zhou (2022b). The discussions are similar to those in Subsections 4.2 and 5.2 and hence are omitted here. An online algorithm for ergodic tasks is presented as an example; see Algorithm 4.

7. Applications

7.1 Mean variance portfolio selection

We ﬁrst review the formulation of the exploratory mean variance portfolio selection problem, originally proposed by Wang and Zhou (2020) and later revisited by Jia and Zhou (2022b).15 The investment universe consists of one risky asset (e.g., a stock index) and one risk-free asset (e.g., a saving account) whose risk-free interest rate is r. The price of the risky asset {St, 0 t T} is governed by a geometric Brownian motion with drift µ and volatility σ > 0, deﬁned on a ﬁltered probability space (Ω, F, PW ; {FW t }0 t T ). An agent has a ﬁxed investment horizon 0 < T < and an initial endowment x0. A self-ﬁnancing portfolio is represented by the real-valued adapted process a = {at, 0 t T}, where at

15. There is a vast literature on classical model-based (non-exploratory) continuous-time mean variance models formulated as stochastic control problems; see e.g. Zhou and Li (2000); Lim and Zhou (2002); Zhou and Yin (2003) and the references therein.

Continuous-Time q-Learning

Algorithm 4 q-Learning Algorithm for Ergodic Tasks Inputs: initial state x0, time step t, initial learning rates αθ, αψ, αV and learning rate schedule function l( ) (a function of time), functional forms of the parameterized value function Jθ( ) and q-function qψ( , ) satisfying (40), functional forms of test functions ξ(t, x t, a t), ζ(t, x t, a t), and temperature parameter γ. Required program (on-policy): an environment simulator (x , r) = Environment t(x, a) that takes initial state x and action a as inputs and generates a new state x at t and an

instantaneous reward r as outputs. Policy πψ(a|x) = exp{ 1

γ qψ(x,a)} R

γ qψ(x,a)}da.

Required program (oﬀ-policy): observations {a, r, x } = Observation(x; t) including the observed actions, rewards, and state when the current state is x under the given behavior policy at the sampling time grids with step size t. Learning procedure:

Initialize θ, ψ, V . Initialize k = 0. Observe the initial state x0 and store xtk x0. loop

{On-policy case Generate action a πψ( |x). Apply a to environment simulator (x , r) = Environment t(x, a), and observe new state x and reward r as outputs. Store xtk+1 x . } {Oﬀ-policy case Obtain one observation atk, rtk, xtk+1 = Observation(xtk; t). } Compute test functions ξtk = ξ(tk, xt0, , xtk, at0, , atk), ζtk = ζ(tk, xt0, , xtk, at0, , atk). Compute δ = Jθ(x ) Jθ(x) + r t qψ(x, a) t V t,

Update θ, V and ψ by θ θ + l(k t)αθ θ,

V V + l(k t)αV V,

ψ ψ + l(k t)αψ ψ.

Update x x and k k + 1. end loop

is the discounted dollar value invested in the risky asset at time t. Then her discounted wealth process follows

d Xt = at[(µ r)dt + σd Wt] = at d(e rt St)

e rt St , X0 = x0.

Jia and Zhou

The agent aims to minimize the variance of the discounted value of the portfolio at time T while maintaining the expected return to be a certain level; that is,

min a Var(Xa T ), subject to E[Xa T ] = z, (43)

where z is the target expected return, and the variance and expectation throughout this subsection are with respect to the probability measure PW . The exploratory formulation of this problem with entropy regularizer is equivalent to

J(t, x; w) = max π E ( Xπ T w)2 γ Z T

t log π(aπ s |s, Xπ)ds Xπ t = x + (w z)2, (44)

d Xπ s = (µ r) Z

R aπ(a|s, Xπ s )dads + σ

R a2π(a|s, Xπ s )dad Ws; Xπ t = x,

where w is the Lagrange multiplier introduced to relax the expected return constraint; see Wang and Zhou (2020) for a derivation of this formulation. Note here we artiﬁcially add the minus sign to transform the variance minimization to a maximization problem to be consistent with our general formulation. We now follow Wang and Zhou (2020) and Jia and Zhou (2022b) to parameterize the value function by

Jθ(t, x; w) = (x w)2e θ3(T t) + θ2(t2 T 2) + θ1(t T) (w z)2.

Moreover, we parameterize the q-function by

qψ(t, x, a; w) = e ψ1 ψ3(T t)

2 a + ψ2(x w) 2 γ

2[log 2πγ + ψ1 + ψ3(T t)],

which is derived to satisfy the constraint (29), as explained in Example 1. The policy associated with this parametric q-function is πψ( |x; w) = N( ψ2(x w), γeψ1+ψ3(T t)). In addition to θ = (θ1, θ2, θ3) and ψ = (ψ1, ψ2, ψ3) , we also learn the Lagrange multiplier w by stochastic approximation in the same way as in Wang and Zhou (2020) and Jia and Zhou (2022b). The full algorithm is summarized in Algorithm 5.16

We then compare by simulations the performances of three learning algorithms: the q-learning based Algorithm 5 in this paper, the PG-based Algorithm 4 in Jia and Zhou (2022b), and the t-parameterized Q-learning algorithm presented in Appendix B. We conduct simulations with the following conﬁgurations: µ {0, 0.1, 0.3, 0.5}, σ {0.1, 0.2, 0.3, 0.4}, T = 1, x0 = 1, z = 1.4. Other tuning parameters in all the algorithms are chosen as γ = 0.1, m = 10, αθ = αψ = 0.001, αw = 0.005, and l(j) = 1 j0.51 . To have more realistic scenarios, we generate 20 years of training data and compare the three algorithms with the same dataset for N = 20, 000 episodes with a batch size 32. More precisely, 32 trajectories with length T are drawn from the training set to update the parameters to

16. Here we present an oﬄine algorithm as example. Online algorithms can also be devised following the general study in the previous sections.

Continuous-Time q-Learning

Algorithm 5 Oﬄine Episodic q-Learning Mean Variance Algorithm Inputs: initial state x0, horizon T, time step t, number of episodes N, number of time grids K, initial learning rates αθ, αψ, αw and learning rate schedule function l( ) (a function of the number of episodes), and temperature parameter γ. Required program: a market simulator x = Market t(t, x, a) that takes current timestate pair (t, x) and action a as inputs and generates state x at time t + t. Learning procedure:

Initialize θ, ψ, w. for episode j = 1 to N do

Initialize k = 0. Observe the initial state x and store xtk x. while k < K do

Generate action atk πψ( |tk, xtk; w). Compute and store the test function ξtk = Jθ

θ (tk, xtk; w), ζtk = qψ

ψ (tk, xtk, atk; w). Apply atk to the market simulator x = Market t(tk, xtk, atk), and observe the output new state x. Store xtk+1. Update k k + 1. end while Store the terminal wealth X(j) T xt K. Compute

i=0 ξti Jθ(ti+1, xti+1; w) Jθ(ti, xti; w) qψ(ti, xti, ati; w) t ,

i=0 ζti Jθ(ti+1, xti+1; w) Jθ(ti, xti; w) qψ(ti, xti, ati; w) t .

Update θ and ψ by θ θ + l(j)αθ θ.

ψ ψ + l(j)αψ ψ.

Update w (Lagrange multiplier) every m episodes: if j 0 mod m then

i=j m+1 X(i) T .

end if end for

Jia and Zhou

be learned for N times. After training completes, for each market conﬁguration we apply the learned policy out-of-sample repeatedly for 100 times and compute the average mean, variance and Sharpe ratio for each algorithm. We are particularly interested in the impact of time discretization; so we experiment with three diﬀerent time discretization steps: t { 1

25, 1 250, 1 2500}. Note that all the algorithms rely on time discretization at the implementation stage, where t is the sampling frequency required for execution or computation of numerical integral for PG and q-learning. For the Q-learning, t plays a dual role: it is both the parameter in its deﬁnition and the time discretization size in implementation. The numerical results are reported in Tables 1 3, each table corresponding to a diﬀerent value of t . We observe that for any market conﬁguration, the Q-learning is almost always the worst performer in terms of all the metrics, and even divergent in certain high volatility and low return environment (when keeping the same learning rate as in the other cases). The other two algorithms have very close overall performances, although the q-learning tends to outperform in high volatile market environments. On the other hand, notably, the results change only slightly with diﬀerent time discretizations, for all the three algorithms. This observation does not align with the claim in Tallec et al. (2019) that the classical Qlearning is very sensitive to time-discretization, at least for this speciﬁc application problem. In addition, we notice that q-learning and PG-based algorithm produce comparable results under most market scenarios, while the terminal variance produced by q-learning tends to be signiﬁcantly higher than that by PG when |µ| is small or σ is large. In such cases, the ground truth solution requires higher leverage in risky assets in order to reach the high target expected return (40% annual return). However, it appears that overall q-learning strives to learn to keep up with the high target return and as a result ends up with high terminal variance in those cases. By contrast, the PG tends to maintain lower variance at the cost of signiﬁcantly underachieving the target return.

7.2 Ergodic linear quadratic control

Consider the ergodic LQ control problem where state responds to actions in a linear way

d Xt = (AXt + Bat)dt + (CXt + Dat)d Wt, X0 = x0,

and the goal is to maximize the long term average quadratic payoﬀ

lim inf T 1 T E Z T

0 r(Xt, at)dt|X0 = x0

with r(x, a) = (M

2 x2 + Rxa + N

2 a2 + Px + Qa). The exploratory formulation of this problem with entropy regularizer is equivalent to

lim inf T 1 T E Z T

0 r(Xπ t , aπ t )dt γ log π(aπ t |Xπ t )dt Xπ 0 = x0

This problem falls into the general formulation of an ergodic task studied in Section 6; so we directly implement Algorithm 4 in our simulation. In particular, our function approximators for J and q are quadratic functions:

Jθ(x) = θ1x2 + θ2x, qψ(x, a) = e ψ3

2 (a ψ1x ψ2)2 γ

2(log 2πγ + ψ3),

Continuous-Time q-Learning

Table 1: Out-of-sample performance comparison in terms of mean, variance and Sharpe ratio when data are generated by geometric Brownian motion with t = 1 25. We compare three algorithms: Policy Gradient described in Algorithm 4 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 5. The ﬁrst two columns specify market conﬁgurations (parameters of the geometric Brownian motion simulated). Mean , Variance , and SR refer respectively to the average mean, variance, and Sharpe ratio of the terminal wealth under the corresponding learned policy over 100 independent runs. We highlight the largest average Sharpe ratio in bold. NA refers to the case where the algorithm diverges. The discretization size is t = 1

µ σ Policy Gradient Q-Learning q-Learning Mean Variance SR Mean Variance SR Mean Variance SR

-0.5 0.1 1.409 0.003 7.134 1.409 0.004 6.663 1.408 0.004 6.805 -0.3 0.1 1.421 0.012 3.880 1.409 0.012 3.707 1.412 0.012 3.762 -0.1 0.1 1.382 0.094 1.285 1.327 0.070 1.263 1.351 0.081 1.272 0.0 0.1 1.090 0.216 0.193 1.098 0.302 0.201 1.108 0.321 0.202 0.1 0.1 1.312 0.147 0.835 1.265 0.108 0.827 1.292 0.133 0.831 0.3 0.1 1.420 0.016 3.312 1.402 0.016 3.181 1.407 0.016 3.224 0.5 0.1 1.405 0.004 6.422 1.403 0.005 6.026 1.403 0.004 6.147 -0.5 0.2 1.417 0.014 3.526 1.416 0.016 3.308 1.416 0.016 3.377 -0.3 0.2 1.445 0.060 1.915 1.428 0.059 1.841 1.434 0.059 1.867 -0.1 0.2 1.335 0.341 0.608 1.402 0.801 0.628 1.405 0.589 0.631 0.0 0.2 1.053 0.539 0.070 1.168 7.995 0.100 1.136 3.422 0.100 0.1 0.2 1.248 0.437 0.380 1.349 2.046 0.411 1.341 1.005 0.412 0.3 0.2 1.445 0.083 1.634 1.422 0.078 1.580 1.431 0.081 1.600 0.5 0.2 1.413 0.018 3.173 1.412 0.020 2.992 1.411 0.019 3.050 -0.5 0.3 1.433 0.039 2.305 1.432 0.043 2.180 1.432 0.041 2.223 -0.3 0.3 1.458 0.162 1.247 1.468 0.196 1.214 1.484 0.217 1.229 -0.1 0.3 1.226 0.569 0.335 1.624 8.959 0.415 1.556 5.558 0.415 0.0 0.3 1.030 0.777 0.036 NA NA NA 1.187 23.269 0.066 0.1 0.3 1.151 0.670 0.202 1.539 13.741 0.271 1.482 13.321 0.271 0.3 0.3 1.448 0.231 1.046 1.475 0.322 1.042 1.485 0.313 1.053 0.5 0.3 1.431 0.048 2.073 1.428 0.052 1.972 1.429 0.050 2.008 -0.5 0.4 1.457 0.093 1.679 1.460 0.102 1.608 1.462 0.102 1.638 -0.3 0.4 1.417 0.346 0.877 1.670 2.363 0.898 1.602 1.176 0.905 -0.1 0.4 1.147 0.805 0.205 NA NA NA 1.711 20.852 0.306 0.0 0.4 1.017 0.893 0.016 NA NA NA 1.211 76.040 0.048 0.1 0.4 1.084 0.825 0.109 NA NA NA 1.539 25.371 0.196 0.3 0.4 1.408 0.379 0.747 NA NA NA 1.609 1.992 0.776 0.5 0.4 1.446 0.102 1.510 1.456 0.125 1.455 1.461 0.128 1.479

Jia and Zhou

Table 2: Out-of-sample performance comparison in terms of mean, variance and Sharpe ratio when data are generated by geometric Brownian motion with t = 1 250. We compare three algorithms: Policy Gradient described in Algorithm 4 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 5. The ﬁrst two columns specify market conﬁgurations (parameters of the geometric Brownian motion simulated). Mean , Variance , and SR refer respectively to the average mean, variance, and Sharpe ratio of the terminal wealth under the corresponding learned policy over 100 independent runs. We highlight the largest average Sharpe ratio in bold. NA refers to the case where the algorithm diverges. The discretization size is t = 1 250.

µ σ Policy Gradient Q-Learning q-Learning Mean Variance SR Mean Variance SR Mean Variance SR

-0.5 0.1 1.409 0.003 7.130 1.407 0.004 6.671 1.407 0.004 6.805 -0.3 0.1 1.419 0.012 3.878 1.407 0.012 3.710 1.410 0.012 3.762 -0.1 0.1 1.380 0.092 1.285 1.325 0.069 1.264 1.347 0.079 1.272 0 0.1 1.092 0.218 0.201 1.096 0.253 0.201 1.106 0.306 0.202 0.1 0.1 1.314 0.148 0.835 1.267 0.109 0.827 1.293 0.133 0.831 0.3 0.1 1.424 0.017 3.311 1.406 0.017 3.183 1.411 0.017 3.224 0.5 0.1 1.411 0.004 6.421 1.409 0.005 6.032 1.409 0.004 6.147 -0.5 0.2 1.415 0.014 3.524 1.414 0.016 3.312 1.413 0.015 3.377 -0.3 0.2 1.439 0.057 1.915 1.423 0.056 1.843 1.428 0.056 1.867 -0.1 0.2 1.343 0.329 0.620 1.369 0.460 0.628 1.390 0.493 0.631 0 0.2 1.054 0.546 0.070 1.173 9.036 0.100 1.128 2.519 0.100 0.1 0.2 1.241 0.413 0.372 1.350 1.356 0.411 1.340 0.925 0.412 0.3 0.2 1.450 0.084 1.634 1.426 0.079 1.581 1.435 0.081 1.600 0.5 0.2 1.420 0.018 3.173 1.418 0.020 2.995 1.418 0.019 3.050 -0.5 0.3 1.428 0.037 2.304 1.426 0.041 2.183 1.426 0.039 2.223 -0.3 0.3 1.467 0.171 1.248 1.454 0.169 1.215 1.466 0.178 1.229 -0.1 0.3 1.230 0.552 0.326 1.661 10.321 0.415 1.495 2.852 0.415 0 0.3 1.031 0.845 0.036 1.236 38.868 0.066 1.184 21.710 0.066 0.1 0.3 1.142 0.716 0.181 1.541 16.335 0.266 1.456 6.433 0.271 0.3 0.3 1.466 0.230 1.064 1.470 0.260 1.043 1.486 0.281 1.053 0.5 0.3 1.438 0.049 2.073 1.434 0.053 1.974 1.435 0.051 2.008 -0.5 0.4 1.451 0.085 1.679 1.448 0.090 1.611 1.449 0.087 1.638 -0.3 0.4 1.424 0.281 0.890 1.535 0.638 0.898 1.544 0.612 0.905 -0.1 0.4 1.153 0.777 0.192 NA NA NA 1.747 22.299 0.306 0 0.4 1.019 1.000 0.021 NA NA NA 1.233 77.875 0.048 0.1 0.4 1.091 1.006 0.110 NA NA NA 1.696 48.274 0.200 0.3 0.4 1.400 0.384 0.704 1.685 2.737 0.771 1.595 1.189 0.776 0.5 0.4 1.463 0.114 1.510 1.463 0.122 1.457 1.466 0.121 1.479

Continuous-Time q-Learning

Table 3: Out-of-sample performance comparison in terms of mean, variance and Sharpe ratio when data are generated by geometric Brownian motion with t = 1 2500. We compare three algorithms: Policy Gradient described in Algorithm 4 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 5. The ﬁrst two columns specify market conﬁgurations (parameters of the geometric Brownian motion simulated). Mean , Variance , and SR refer respectively to the average mean, variance, and Sharpe ratio of the terminal wealth under the corresponding learned policy over 100 independent runs. We highlight the largest average Sharpe ratio in bold. NA refers to the case where the algorithm diverges. The discretization size is t = 1 2500.

µ σ Policy Gradient Q-Learning q-Learning Mean Variance SR Mean Variance SR Mean Variance SR

-0.5 0.1 1.408 0.003 7.130 1.406 0.004 6.677 1.406 0.004 6.805 -0.3 0.1 1.418 0.012 3.878 1.406 0.012 3.713 1.409 0.012 3.762 -0.1 0.1 1.376 0.090 1.285 1.324 0.069 1.264 1.345 0.077 1.272 0.0 0.1 1.092 0.218 0.201 1.096 0.259 0.201 1.105 0.300 0.202 0.1 0.1 1.316 0.149 0.835 1.269 0.111 0.827 1.294 0.133 0.831 0.3 0.1 1.425 0.017 3.311 1.407 0.017 3.185 1.412 0.017 3.224 0.5 0.1 1.412 0.004 6.420 1.410 0.005 6.037 1.410 0.004 6.147 -0.5 0.2 1.413 0.014 3.524 1.411 0.016 3.315 1.411 0.015 3.377 -0.3 0.2 1.435 0.056 1.915 1.419 0.055 1.844 1.424 0.056 1.867 -0.1 0.2 1.328 0.295 0.632 1.366 0.455 0.629 1.384 0.486 0.631 0.0 0.2 1.059 0.610 0.074 1.174 9.894 0.100 1.125 2.521 0.100 0.1 0.2 1.254 0.468 0.380 1.360 1.867 0.411 1.340 0.952 0.412 0.3 0.2 1.450 0.084 1.634 1.428 0.080 1.582 1.436 0.081 1.600 0.5 0.2 1.422 0.018 3.173 1.419 0.020 2.998 1.419 0.019 3.050 -0.5 0.3 1.425 0.036 2.304 1.422 0.040 2.185 1.422 0.039 2.223 -0.3 0.3 1.459 0.169 1.248 1.449 0.167 1.216 1.459 0.177 1.229 -0.1 0.3 1.225 0.597 0.327 1.690 11.223 0.415 1.487 2.943 0.415 0.0 0.3 1.039 0.857 0.044 1.206 36.278 0.066 1.183 23.798 0.066 0.1 0.3 1.169 0.745 0.202 1.457 14.390 0.266 1.462 8.590 0.271 0.3 0.3 1.455 0.215 1.064 1.488 0.356 1.044 1.486 0.296 1.053 0.5 0.3 1.438 0.049 2.073 1.435 0.053 1.975 1.436 0.051 2.008 -0.5 0.4 1.446 0.084 1.679 1.443 0.089 1.612 1.444 0.086 1.638 -0.3 0.4 1.395 0.231 0.873 1.529 0.619 0.899 1.537 0.622 0.905 -0.1 0.4 1.175 0.846 0.229 NA NA NA 1.754 24.745 0.306 0.0 0.4 1.026 0.967 0.025 NA NA NA 1.170 1.66E+09 0.048 0.1 0.4 1.110 0.973 0.129 NA NA NA 1.661 59.872 0.200 0.3 0.4 1.402 0.416 0.720 1.646 3.402 0.771 1.607 1.743 0.776 0.5 0.4 1.456 0.108 1.510 1.469 0.135 1.458 1.467 0.128 1.479

Jia and Zhou

Figure 1: Running average rewards of three RL algorithms. A single state trajectory is generated with length T = 106 and discretized at t = 0.1 to which three online algorithms apply: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 4. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation. Two dashed horizontal lines are respectively the omniscient optimal average reward without exploration when the model parameters are known and the omniscient optimal average reward less the exploration cost.

where the form of the q-function is derived to satisfy the constraint (29), as discussed in Example 1. The policy associated with this parameterized q-function is πψ( |x) = N(ψ1x+ ψ2, γeψ3). In addition, we have another parameter V that stands for the long-term average.

We then compare our online q-learning algorithm against two theoretical benchmarks and two other online algorithms, in terms of the running average reward during the learning process. The ﬁrst benchmark is the omniscient optimal level, namely, the maximum long term average reward that can be achieved with perfect knowledge about the environment and reward (and hence exploration is unnecessary and only deterministic policies are considered). The second benchmark is the omniscient optimal level less the exploration cost, which is the maximum long term average reward that can be achieved by the agent who is however forced to explore under entropy regularization. The other two algorithms are the PG-based ergodic algorithm proposed in (Jia and Zhou, 2022b, Algorithm 3) and the t-based Q-learning algorithm presented in Appendix B.

In our simulation, to ensure the stationarity of the controlled state process, we set A = 1, B = C = 0 and D = 1. Moreover, we set x0 = 0, M = N = Q = 2, R = P = 1, and γ = 0.1. Learning rates for all algorithms are initialized as αψ = 0.001, and decay

Continuous-Time q-Learning

(a) The path of learned ψ1.

(b) The path of learned ψ2.

(c) The path of learned eψ3.

Figure 2: Paths of learned parameters of three RL algorithms. A single state trajectory is generated with length T = 106 and discretized at t = 0.1 to which three online algorithms apply: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 4. All the policies are restricted to be in the parametric form of πψ( |x) = N(ψ1x+ψ2, eψ3). The omniscient optimal policy is ψ 1 0.354, ψ 2 0.708, eψ 3 0.035, shown in the dashed line. We repeat the experiments for 100 times for each method and plot as the shaded area the standard deviation of the learned parameters. The width of each shaded area is twice the corresponding standard deviation.

(b) t = 0.1

(c) t = 0.01

Figure 3: Running average rewards of three RL algorithms with diﬀerent time discretization sizes. A single state trajectory is generated with length T = 105 and discretized at diﬀerent step sizes: t = 1 in (a), t = 0.1 in (b), and t = 0.01 in (c). For each step size, we apply three online algorithms: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q Learning described in Algorithm 4. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation.

according to l(t) = 1 max{1, log t}. We also repeat the experiment for 100 times under diﬀerent seeds to generate samples. First, we ﬁx t = 0.1 and run the three learning algorithms for suﬃciently long time T = 106. All the parameters to be learned are initialized as 0 except for the initial variance of the policies which is set to be 1. Figure 1 plots the running average reward trajectories, along with their standard deviations (which are all too small to be visible in the ﬁgure),

Jia and Zhou

as the learning process proceeds. Among the three methods, both the PG and q-learning algorithms perform signiﬁcantly better and converge to the optimal level much faster than the Q-learning. The q-learning algorithm is indeed slightly better than the PG one. Figure 2 shows the iterated values of the leaned parameters for the three algorithms. Eminently, the Q-learning has the slowest convergence. The other two algorithms, while converging at similar speed eventually, exhibit quite diﬀerent behaviors on their ways to convergence in learning ψ1 and ψ2. The PG seems to learn these parameters faster than the q-learning at the initial stage, but then quickly overshoot the optimal level and take a long time to correct it, while the q-learning appears much smoother and does not have such an issue. At last, we vary the t value to examine its impact on the performances of the three algorithms. We vary t {1, 0.1, 0.01} and set T = 105 for each experiment. The learning rates and parameter initializations are ﬁxed and the same for all the there methods. We repeat the experiment for 100 times and present the results in terms of the average running reward in Figure 3. We use the shaded area to denote the standard deviation. It is clear that the Q-learning algorithm is very sensitive to t. In particular, its performance worsens signiﬁcantly as t becomes smaller. When t = 0.01, the algorithm almost has no improvement at all over time. This drawback of sensitivity in t is consistent with the observations made in Tallec et al. (2019). On the other hand, the other two algorithms show remarkable robustness across diﬀerent discretization sizes.

7.3 Oﬀ-policy ergodic linear quadratic control

The experiments reported in Subsection 7.2 are for on-policy learning. In this subsection we revisit the problem but assuming that we now have to work oﬀ-policy. Speciﬁcally, we have a suﬃciently long observation of the trajectories of state, action and reward generated under a behavior policy that is not optimal. In our experiment, we take the behavior policy to be at N(0, 1). We still examine the three algorithms: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 4, with diﬀerent time discretization sizes. Because oﬀ-policy learning is often necessitated in cases where online interactions are costly (Uehara et al., 2022), we restrict ourselves to an oﬄine dataset generated under the behavior policy. We vary t {1, 0.1, 0.01} and set T = 106 for each experiment. We repeat the experiments for 100 times whose results are presented in Figure 4. Because it is generally hard or impossible to implement the learned policy and observe the resulting state and reward data in an oﬀpolicy setting, we focus on the learned parameters of the policy to see if they converge to the desired optimal value. The horizontal axis in Figure 4 stands for the number of iterations that are used to update the policy parameters. In general, policy gradient methods are not applicable in the oﬀ-policy setting. This is conﬁrmed by our experiments which show that the corresponding algorithm diverges quickly in all the cases. Q-learning still suﬀers from the sensitivity with respect to time discretization. The q-learning algorithm is the most stable one and converges in all scenarios. It is worth noting that in Figure 4-(a), the Q-learning and q-learning algorithms converge to the same limits, which are however diﬀerent from the true optimal parameter values of the continuous-time problem. This is because the theoretical integrals involved in the value

Continuous-Time q-Learning

functions are not well approximated by ﬁnite sums with coarse time discretization ( t = 1); hence the corresponding martingale conditions under the learned parameters are not close to the theoretical continuous-time martingale conditions.

(b) t = 0.1

(c) t = 0.01

Figure 4: Paths of learned parameters of three RL algorithms with diﬀerent time discretization sizes. A single state trajectory is generated with length T = 106

and discretized at diﬀerent step sizes: t = 1 in (a), t = 0.1 in (b), and t = 0.01 in (c), under the behavior policy at N(0, 1). From top to bottom are the paths of learned ψ1, ψ2, eψ3 respectively. For each step size, we apply three algorithms: Policy Gradient described in Algorithm 3 in Jia and Zhou (2022b), Q-Learning described in Appendix B, and q-Learning described in Algorithm 4. We repeat the experiments for 100 times for each method and plot the average reward over time with the shaded area indicating standard deviation.

8. Conclusion

The previous trilogy on continuous-time RL under the entropy-regularized exploratory diffusion process framework (Wang et al., 2020; Jia and Zhou, 2022a,b) have respectively studied three key elements: optimal samplers for exploration, PE and PG. PG is a special

Jia and Zhou

instance of policy improvement, and in the discrete-time MDP literature, it can be done through Q-function. However, there was no satisfying Q-learning theory in continuous time, so a diﬀerent route was taken in Jia and Zhou (2022b) to turn the PG task into a PE problem. This paper ﬁlls this gap and adds yet another essential building block to the theoretical foundation of continuous-time RL, by developing a q-learning theory commensurate with the continuous-time setting, attacking the general policy improvement problem, and covering both onand oﬀ-policy learning. Although the conventional Q-function provides no information about actions in continuous time, its ﬁrst-order approximation, which we call the q-function, does. Moreover, it turns out that the essential component of the q-function relevant to policy updating is the Hamiltonian associated with the problem, the latter having already played a vital role in the classical model-based stochastic control theory. This fact, together with the expression of the optimal stochastic policies explicitly derived in Wang et al. (2020), in turn, explains and justiﬁes the widely used Boltzmann exploration in classical RL for MDPs. We characterize the q-function as the compensator to maintain the martingality of a process consisting of the value function and cumulative reward, both in the on-policy and oﬀpolicy settings. This characterization enables us to use the martingale-based PE techniques developed in Jia and Zhou (2022a) to simultaneously learn the q-function and the value function. Interestingly, the martingality is on the same process as in PE (Jia and Zhou, 2022a) but with respect to an enlarged ﬁltration containing the policy randomization. The TD version of the resulting algorithm links to the well-known SARSA algorithm in classical Q-learning. A signiﬁcant and outstanding research question is an analysis on the convergence rates of q-learning algorithms. The existing convergence rate results for discrete-time Q-learning cannot be extended to continuous-time q-learning due to the continuous state space and general nonlinear function approximations, along with the fact that the behaviors of the Qfunction and the q-function can be fundamentally diﬀerent. A possible remedy is to carry out the convergence analysis within the general framework of stochastic approximation, which is poised to be the subject of a substantial future study. We make a ﬁnal observation to conclude this paper. The classical model-based approach typically separates estimation and optimization , a la Wonham s separation theorem (Wonham, 1968) or the plug-in method. This approach ﬁrst learns a model (including formulating and estimating its coeﬃcients/parameters) and then optimizes over the learned model. Model-free (up to some basic dynamic structure such as diﬀusion processes or Markov chains) RL takes a diﬀerent route: it skips estimating a model and learns optimizing policies directly via PG or Q/q-learning. The q-learning theory in this paper suggests that, to learn policies, one needs to learn the q-function or the Hamiltonian. In other words, it is the Hamiltonian which is a speciﬁc aggregation of the model coeﬃcients, rather than each and every individual model coeﬃcient, that needs to be learned/estimated for optimization. Clearly, from a pure computational standpoint, estimating a single function the Hamiltonian is much more eﬃcient and robust than estimating multiple functions (b, σ, r, h in the present paper) in terms of avoiding or reducing over-parameterization, sensitivity to errors and accumulation of errors. More importantly, however, (35) hints that the Hamiltonian/q-function can be learned through temporal diﬀerences of the value

Continuous-Time q-Learning

function, so the task of learning and optimizing can be accomplished in a data-driven way. This would not be the case if we chose to learn individual model coeﬃcients separately. This observation, we believe, highlights the fundamental diﬀerences between the model-based and model-free approaches.

Acknowledgments

Zhou is supported by a start-up grant and the Nie Center for Intelligent Asset Management at Columbia University. His work is also part of a Columbia-City U/HK collaborative project that is supported by the Inno HK Initiative, The Government of the HKSAR, and the AIFT Lab. We are grateful for comments from seminar and conference participants at Chinese University of Hong Kong, University of Iowa, Columbia University, Seoul National University, POSTECH, Ritsumeikan University, Shanghai University of Finance and Economics, Fudan University, The 2022 INFORMS Annual Meeting in Indianapolis, The 11th Annual Meeting of FE Branch in OR Society of China in Shijiazhuang, Conference in Memory of Tomas Bj ork in Stockholm, and Post-Bachelier Congress Workshop in Hong Kong. In particular, we beneﬁt from discussions with Jiro Akahori, Xuefeng Gao, Bong-Gyu Jang, Hyeng Keun Koo, Lingfei Li, Hideo Nagai, Jun Sekine, Wenpin Tang, and David Yao. We are especially indebted to the three anonymous referees for their constructive and detailed comments that have led to an improved version of the paper.

Jia and Zhou

Appendix A. Soft Q-learning in Discrete Time

We review the soft Q-learning for discrete-time Markov decision processes (MDPs) here and present the analogy to some of the major results developed in the main text. It is interesting that, to our best knowledge, the martingale perspective for MDPs is not explicitly presented. This in turn suggests that a continuous-time study may provide new aspects and insights even for discrete-time RL. For simplicity, we consider a time-homogeneous MDP X = {Xt, t = 0, 1, 2, } with a state space X, an action space A, and a transition matrix P(X1 = x |X0 = x, a0 = a) =: p(x |x, a). Both X and A are ﬁnite sets. The expected reward at (x, a) is r(x, a) with a discount factor β (0, 1). The agent s total expected reward is E P t=0 βtr(Xt, at) . A (stochastic) policy is denoted by π( |x) P(A), which is a probability mass function on A.

Appendix A1. Q-function associated with an arbitrary policy

Deﬁne the value function associated with a given policy π by

t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] Xπ 0 = x

=E [r(x, aπ 0 ) γ log π(aπ 0 |x)] + βE h J(Xπ 1 ; π) Xπ 0 = x i ,

and the Q-function associated with π by

Q(x, a; π) =r(x, a) + E

t=1 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] Xπ 0 = x, aπ 0 = a

=r(x, a) + βE h J(Xa 1; π) Xπ 0 = x, aπ 0 = a i .

Adding the entropy term γ log π(a|x) to both sides of (46), integrating over a and noting (45), we obtain a relation between the Q-function and the value function :

E [Q(x, aπ; π) J(x; π) γ log π(aπ|x)] = 0. (47)

Moreover, substituting J(Xa 1; π) with (47) to the right hand side of (46), we further obtain

Q(x, a; π) = r(x, a) + βE h Q(Xa 1, aπ 1 ; π) γ log π(aπ 1 |Xa 1) Xπ 0 = x, aπ 0 = a i . (48)

Applying suitable algorithms (e.g., stochastic approximation) to solve the equation (48) leads to the classical SARSA algorithm. On the other hand, rewrite (46) as

J(x; π) = r(x, a) [Q(x, a; π) J(x; π)] + βE h J(Xa 1; π) Xπ 0 = x, aπ 0 = a i . (49)

Recall that A(x, a; π) := Q(x, a; π) J(x; π) is called the advantage function for MDPs. The q-function in the main text (see (17)) is hence an advantage rate function in the continuous-time setting. In particular, (47) is analogous to the second constraint on qfunction in (29).

Continuous-Time q-Learning

Further, (49) implies that

Mπ s := βs J(Xπ s ; π) +

t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i

is an ({Fs}s 0, P)-martingale under any policy π , where Fs is the σ-algebra generated by Xπ 0 , aπ 0 , , Xπ s , aπ s . To see this, we compute

βs J(Xπ s ; π) +

t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i Fs 1

=βs E h J(Xπ s ; π) Xπ s 1, aπ s 1 i +

t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i

=βs 1 h J(Xπ s 1; π) r(Xπ s 1, aπ s 1) + A(Xπ s 1, aπ s 1; π) i +

t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i

=βs 1J(Xπ s 1; π) +

t=0 βt h r(Xπ t , aπ t ) A(Xπ t , aπ t ; π) i .

(50) Note that here Fs is larger than the usual historical information set up to time s, denoted by Hs, which is the σ-algebra generated by Xπ 0 , aπ 0 , , Xπ s without observing the last aπ s (i.e. Hs is Fs excluding aπ s ). Because Mπ s is measurable to the smaller σ-algebra Hs, it is automatically an ({Hs}s 0, P)-martingale. The above analysis shows that Mπ s is a martingale for any π whether the target policy π or a behavior one so long as the advantage function A is taken as the compensator in deﬁning Mπ s . This is the essential reason why Q-learning works for both onand oﬀpolicy learning. However, the conclusion is not necessarily true in other types of learning methods. For example, let us take a diﬀerent compensator and investigate the process βs J(Xπ s ; π) + Ps 1 t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i . The continuous-time counterpart of this process is a martingale required in the policy gradient method; see (Jia and Zhou, 2022b, Theorem 4). Now, conditioned on Hs 1 and when π = π, we have

βs J(Xπ s ; π) +

t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] Hs 1

=βs E h E J(Xπ s ; π) Xπ s 1, aπ s 1 Xπ s 1 i +

t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )]

=βs 1 h J(Xπ s 1; π) r(Xπ s 1, aπ s 1) + A(Xπ s 1, aπ s 1; π) Xπ s 1 i +

t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )]

=βs 1J(Xπ s 1; π) +

t=0 βt [r(Xπ t , aπ t ) γ log π(aπ t |Xπ t )] ,

Jia and Zhou

where the last equality is due to (47), which can be applied because aπ s 1 is generated under the policy π. This shows that the process is an ({Hs}s 0, P)-martingale when π = π, which in turn underpins the on-policy learning. However, when conditioned on Fs 1, we have

βs J(Xπ s ; π) +

t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i Fs 1

=βs E h J(Xπ s ; π) Xπ s 1, aπ s 1 i +

t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i

=βs 1 h J(Xπ s 1; π) r(Xπ s 1, aπ s 1) + A(Xπ s 1, aπ s 1; π) i +

t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i

=βs 1J(Xπ s 1; π) +

t=0 βt h r(Xπ t , aπ t ) γ log π(aπ t |Xπ t ) i

+ βs 1 h A(Xπ s 1, aπ s 1; π) γ log π(aπ t |Xπ t ) i .

So the same process is not an ({Fs}s 0, P)-martingale in general, unless A(Xπ s 1, aπ s 1; π) = γ log π(aπ t |Xπ t ) for any Xπ s 1, aπ s 1. Two conclusions can be drawn from this example: on one hand, policy gradient works for on-policy but not oﬀ-policy, and on the other hand, it is important to choose the correct ﬁltration to ensure martingality and hence the correct test functions in designing learning algorithms. The analysis in (50) yields a joint martingale characterization of (J, A), analogous to Theorem 7. It is curious that, to our best knowledge, such a martingale condition, while not hard to derive in the MDP setting, has not been explicitly introduced nor utilized to study Q-learning in the existing RL literature.

Appendix A2. Q-function associated with the optimal policy

Now consider the value function and Q-function that are associated with the optimal policy, denoted by J and Q respectively. Recall that the Bellman equation implies

J (x) = sup π Ea π n r(x, a) γ log π(a|x) + βE h J (Xa 1) X0 = x, a0 = a io . (51)

It follows from (46) that

Q (x, a) = r(x, a) + βE h J (Xa 1) X0 = x, a0 = a i . (52)

Hence (51) becomes

J (x) = sup π Ea π [Q (x, a) γ log π(a|x)] .

Solving the optimization on the right hand side of the above we get the optimal policy

π (a|x) exp{ 1

γ Q (x, a)} or π (a|x) = exp{ 1

γ Q (x,a)} P a A exp{ 1

γ Q (x,a)}. Denote

softγ max a Q (x, a) := Ea π [Q (x, a) γ log π (a|x)] γ log X

γ Q (x, a)}.

Continuous-Time q-Learning

Then the Bellman equation becomes J (x) = softγ maxa Q (x, a) and (52) reduces to

Q (x, a) = r(x, a) + βE softγ max a Q (Xa 1, a ) X0 = x, a0 = a , (53)

which is the foundation for (oﬀ-policy) Q-learning algorithms; see e.g. (Sutton and Barto, 2018, p. 131). On the other hand, because J (x) = softγ maxa Q (x, a) = γ log P a A exp{ 1

γ Q (x, a)}, we have X

γ [Q (x, a) J (x)] = 1. (54)

Recall the advantage function A (x, a) = Q (x, a) J (x). Thus (54) is analogous to (25). Moreover, rearranging (52) yields

J (x) = r(x, a) [Q (x, a) J (x)] + βE h J (Xa 1) X0 = x, a0 = a i .

Based on a similar derivation to that in Appendix A1, we obtain

βs J (Xπ s ) +

t=0 βt [r(Xπ t , aπ t ) A (Xπ t , aπ t )]

is an ({Fs}s 0, P)-martingale for any policy π; hence it is analogous to the martingale characterization of optimal q-function in Theorem 9.

Appendix B. t-Parameterized Q-Learning

Instead of the (little) q-learning approach developed in this paper, one can also apply the conventional (big) Q-learning method to the Q-function Q t(t, x, a; π) deﬁned in (15). Note that t > 0 becomes a parameter in the latter approach. In this Appendix, we review this t-parameterized Q-learning and the associated Q-learning algorithms, which are used in the simulation experiments of the paper for comparison purpose. Given a policy π and a time step t > 0, Q-learning focuses on learning Q t(t, x, a; π). As with q-learning, there are two distinctive scenarios: when the normalizing constant is easily available and when it is not. In the following, we denote by Qψ t the function approximator to the Q-function where ψ Ψ RLψ is the ﬁnite dimensional parameter vector to be learned, and by πψ the associated policy where πψ(a|t, x) exp{ 1 γ t Qψ t(t, x, a)}.

Appendix B1. When normalizing constant is available

When the normalization constant R

A exp{ 1 γ t Qψ t(t, x, a)}da can be explicitly calculated, we can obtain the exact expression of πψ as

πψ(a|t, x) = exp{ 1 γ t Qψ t(t, x, a)} R

A exp{ 1 γ t Qψ t(t, x, a)}da .

Jia and Zhou

Then, the conventional Q-learning method (e.g, SARSA, Sutton and Barto 2018, Chapter 10) leads to the following updating rule of the Q-function parameters ψ:

Qψ t(t + t, Xat t+ t, at+ t) γ log πψ(at+ t|t + t, Xat t+ t) t

Qψ t(t, Xt, at) + r(t, Xt, at) t βQψ t(t, Xt, at) t Qψ t ψ (t, Xt, at),

where at πψ( |t, Xt) and at+ t πψ( |t + t, Xt+ t).

Appendix B2. When normalizing constant is unavailable

When the normalizing constant is not available, we cannot compute the exact expression of πψ. In this case we follow an analysis analogous to that in Subsection 5.1 by using a family of policies {πφ(a|t, x)}φ Φ whose densities can be easily calculated to approximate πψ. Theorem 10 implies that if we start from a policy πφ, φ Φ, and evaluate its corresponding Q-function 1 γ t Qψ t(t, x, a) 1

γ H t, x, , J

x(t, x; πφ), 2J

x2 (t, x; πφ) + 1 γ t J(t, x; πφ), then

a new policy πφ , φ Φ, improves πφ so long as

πφ ( |t, x) exp{ 1

γ t Qψ t(t, x, )} DKL

πφ( |t, x) exp{ 1

γ t Qψ t(t, x, )} .

Thus, following the same lines of argument as in Subsection 5.1, we derive the updating rule of φ every step as

log πφ(at|t, Xt) 1 γ t Qψ t(t, Xt, at)

φ log πφ(at|t, Xt),

where at πφ( |t, Xt).

Appendix B3. Ergodic tasks

For the general regularized ergodic problem formulated in Section 6, we deﬁne the Qfunction:

Q t(x, a; π) = J(x; π) + H x, a, J

x(x; π), 2J

x2 (x; π) V (π) t J. (56)

Then we have all the parallel results regarding relation among the Q-function, the value function and the intertemporal increment of the Q-function, and devise algorithms accordingly. For illustration, we only present the results when

πψ(a|x) = exp{ 1 γ t Qψ t(x, a)} R

A exp{ 1 γ t Qψ t(x, a)}da

Continuous-Time q-Learning

is explicitly available. The updating rules of the parameters ψ and V are

Qψ t(Xat t+ t, at+ t) γ log πψ(at+ t|Xat t+ t) t

Qψ t(Xt, at) + r(Xt, at) t V t Qψ t ψ (Xt, at)

Qψ t(Xat t+ t, at+ t) γ log πψ(at+ t|Xat t+ t) t

Qψ t(Xt, at) + r(Xt, at) t V t ,

where at πψ( |Xt) and at+ t πψ( |Xt+ t).

Appendix B4. Q-learning for mean variance portfolio selection

Motivated by the form of the solution in Wang and Zhou (2020), we parameterize the Q-function by

Qψ t(t, x, a; w) = e ψ3(T t)[(x w)2+ψ2a(x w)+1

2e ψ1a2]+ψ4(t2 T 2)+ψ5(t T)+(w z)2.

Here we use e ψ1 < 0 to guarantee the concavity of the Q-function in a so that the function can be maximized. This Q-function in turn gives rise to an explicit form of policy, which is Gaussian:

πψ( |t, x; w) = N ψ2eψ1(x w), γ teψ3(T t)+ψ1 exp{ 1

γ t Qψ t(t, x, )}.

We need the following derivative in the corresponding Q-learning algorithms:

Qψ t ψ (t, x, a; w) =

1 2e ψ3(T t)e ψ1a2

e ψ3(T t)e ψ1a(x w) (T t)e ψ3(T t)[(x w)2 + ψ2a(x w) + 1

2e ψ1a2] t2 T 2

The parameters in the Q-function can then be updated based on (55) and the Lagrange multiplier can be updated similarly as in Algorithm 5.

Appendix B5. Q-learning for ergodic LQ control

For the ergodic LQ control problem formulated in Subsection 7.2, we can prove that the value function and the Hamiltonian are both quadratic in (x, a); see, e.g., Wang et al. (2020). Therefore we parameterize the Q-function by a general quadratic function:

Qψ t(x, a) = e ψ3

2 (a ψ1x ψ2)2 + ψ4x2 + ψ5x.

Jia and Zhou

Here we omit the constant term since the value function is unique only up to a constant. We also use e ψ3 < 0 to ensure that the Q-function is concave in a. The optimal value V is an extra parameter. The Q-function leads to

πψ( |x) = N ψ1x + ψ2, γ teψ3 exp{ 1

γ t Qψ t(x, )}.

Finally, we compute the following derivative for use in Q-learning algorithms:

Qψ t ψ (x, a) = e ψ3(a ψ1x ψ2)x, e ψ3(a ψ1x ψ2), e ψ3

2 (a ψ1x ψ2)2, x2, x .

Then the parameters can be updated based on (57).

Appendix C. Proofs of Statements

Proof of Theorem 2

We ﬁrst prove a preliminary result about the entropy maximizing density.

Lemma 13 Let γ > 0 and a measurable function q : A R with R

γ q(a)}da <

be given. Then π (a) = exp{ 1

γ q(a)}da P(A) is the unique maximizer of the following

max π( ) P(A)

A [q(a) γ log π(a)]π(a)da. (58)

Proof Set ν = 1 log R

γ q(a)}da and consider a new problem:

max π( ) P(A)

[q(a) γ log π(a)]π(a) + γνπ(a) da γν. (59)

A π(a)da = 1 due to π( ) P(A), problems (58) and (59) are equivalent; hence we focus on problem (59). Relaxing the constraint of being a density function we deduce

max π( ) P(A)

[q(a) γ log π(a)]π(a) + γνπ(a) da γν

[q(a) γ log π(a)]π(a) + γνπ(a) da γν

A max π>0 [q(a) γ log π]π + γνπ da γν.

The unique maximizer of the inner optimization on the right hand side of the above is

γ q(a) + ν 1} = exp{ 1

γ q(a)}da =: π (a).

As π P(A), it is the optimal solution to (59) or, equivalently, (58). The uniqueness follows from that of the inner optimization in (60).

Continuous-Time q-Learning

Now we are ready to prove Theorem 2. To ease notation we use the q-function q even though it had not been introduced prior to Theorem 2 in the main text. We also employ the results of Theorem 6 whose proof does not rely on Theorem 2 nor its proof here. Recall from (16), q(t, x, a; π) is deﬁned to be

q(t, x, a; π) = J

t (t, x; π) + H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) βJ(t, x; π),

with the Hamiltonian H deﬁned in (3).

Proof [Proof of Theorem 2] We work on the equivalent formulation (8) (9). For two given admissible policies π and π , and any 0 t T, apply Itˆo s lemma to the process e βs J(s, Xπ s ; π) from s [t, T], which is the value function under π but over the state process under π :

e βT J(T, Xπ T ; π) e βt J(t, Xπ t ; π) + Z T

A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads

t (s, Xπ s ; π) + H s, Xπ s , a, J

x(s, Xπ s ; π), 2J

x2 (s, Xπ s ; π) βJ(s, Xπ s ; π)

γ log π (a|s, Xπ s ) π (a|s, Xπ s )dads + Z T

x J(s, Xπ s ; π) σ s, Xπ s , π ( |s, Xπ s ) d Ws

q(s, Xπ s , a; π) γ log π (a|s, Xπ s ) π (a|s, Xπ s )dads

x J(s, Xπ s ; π) σ s, Xπ s , π ( |s, Xπ s ) d Ws.

Because π = Iπ, it follows from Lemma 13 that for any (s, y) [0, T] Rd, we have Z

q(s, y, a; π) γ log π (a|s, y) π (a|s, y)da Z

q(s, y, a; π) γ log π(a|s, y) π(a|s, y)da = 0,

where the equality is due to (20) in Theorem 6. Thus,

e βT J(T, Xπ T ; π) e βt J(t, Xπ t ; π) + Z T

A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads

x J(s, Xπ s ; π) σ s, Xπ s , π ( |s, Xπ s ) d Ws.

The above argument and the resulting inequalities are also valid when T is replaced by T un, where un = inf{s t : | Xπ s | n} is a sequence of stopping times. Therefore,

J(t, x; π) EPW e β(T un t)h( Xπ T un)

t e β(s t) Z

A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads Xπ t = x .

Jia and Zhou

It follows from Assumption 1-(iv), Deﬁnition 1-(iii) and the moment estimate in Jia and Zhou (2022b, Lemma 2) that there exist constants µ, C > 0 such that EPW Z T un

t e β(s t) Z

A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads Xπ t = x

A |r(s, Xπ s , a) γ log π (a|s, Xπ s )|π (a|s, Xπ s )dads Xπ t = x

t EPW (1 + | Xπ s |µ) Xπ t = x dt C(1 + |x|µ).

By the dominated convergence theorem, we have as n ,

t e β(s t) Z

A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads Xπ t = x

t e β(s t) Z

A [r(s, Xπ s , a) γ log π (a|s, Xπ s )]π (a|s, Xπ s )dads Xπ t = x .

In addition,

EPW e β(T un t)h( Xπ T un) Xπ t = x

EPW h( Xπ T )1{maxt s T | Xπ T |<n}

Xπ t = x + EPW h( Xπ un)1{maxt s T | Xπ T | n}

CEPW (1 + | Xπ T |µ) Xπ t = x + C(1 + nµ)EPW 1{maxt s T | Xπ T | n}

C(1 + |x|µ) + C(1 + nµ) EPW maxt s T | Xπ T |µ+1 Xπ t = x

C(1 + |x|µ) + C(1 + nµ)(1 + |x|µ+1)

Again, by the dominated convergence theorem, we have as n ,

EPW e β(T un t)h( Xπ T un) Xπ t = x EPW e β(T t)h( Xπ T ) Xπ t = x .

Hence, sending n , we conclude from (61) that J(t, x; π) J(t, x; π ). This proves that π = Iπ improves upon π. Moreover, if Iπ π = π, then J(t, x; π) J(t, x; π ) =: J (t, x), which satisﬁes the PDE (11). However, Lemma 13 shows that Z

H t, x, a, J

x (t, x), 2J

x2 (t, x) γ log π (a|t, x) π (a|t, x)da

= sup π P(A)

H t, x, a, J

x (t, x), 2J

x2 (t, x) γ log π(a) π(a)da.

This means that J also satisﬁes the HJB equation (12), implying that J is the optimal value function and hence π is the optimal policy.

Continuous-Time q-Learning

Proof of Proposition 3

Proof Breaking the full time period [t, T] into subperiods [t, t + t) and [t + t, T], while conditioning on the state at t + t for the second subperiod, we obtain:

Q t(t, x, a; π)

t e β(s t)r(s, Xa s , a)ds

t+ t e β(s t)[r(s, Xπ s , aπ s ) γ log π(aπ s |s, Xπ s )]ds + e β(T t)h(Xπ T )|Xa t+ t X π t = x

=EPW Z t+ t

t e β(s t)r(s, Xa s , a)ds + e β t J(t + t, Xa t+ t; π) X π t = x

=EPW Z t+ t

t e β(s t)r(s, Xa s , a)ds + e β t J(t + t, Xa t+ t; π) J(t, x; π) X π t = x + J(t, x; π)

=EPW Z t+ t

t e β(s t) J

t (s, Xa s ; π) + H s, Xa s , a, J

x(s, Xa s ; π), 2J

x2 (s, Xa s ; π)

βJ(s, Xa s ; π) ds X π t = x + J(t, x; π)

=J(t, x; π) + J

t (t, x; π) + H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) βJ(t, x; π) t + o( t),

(62) where the second to the last equality follows from Itˆo s lemma and the last equality and the error order are due to the approximation of the integral involved.

Proof of Corollary 5

Proof Its proof directly follows from the results of Proposition 3.

Proof of Theorem 6

Proof First, (20) follows readily from its deﬁnition in Deﬁnition 4, the Feynman Kac formula (11), and the fact that J

t (t, x; π) and βJ(t, x; π) both do not depend on action a.

Jia and Zhou

(i) We now focus on (18). Applying Itˆo s lemma to the process e βs J(s, Xπ s ; π), we obtain for 0 t < s T:

e βs J(s, Xπ s ; π) e βt J(t, x; π) + Z s

t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du

t (u, Xπ u , aπ u ) + H u, Xπ u , aπ u , J

x(u, Xπ u ; π), 2J

x2 (u, Xπ u ; π) βJ(u, Xπ u ; π)

ˆq(u, Xπ u , aπ u ) i du + Z s

x J(u, Xπ u ; π) d Wu

t e βu q(u, Xπ u , aπ u ; π) ˆq(u, Xπ u , aπ u ) du + Z s

x J(u, Xπ u ; π) σ(u, Xπ u , aπ u )d Wu.

Recall that {aπ s , t s T} is {Fs}s 0-progressively measurable. So, if ˆq q( , , ; π), then the above process, and hence (18), is an ({Fs}s 0, P)-martingale on [t, T].

Conversely, if the right hand side of the above is a martingale, then, because its second term is a local martingale, we have that R s t e βu q(u, Xπ u , aπ u ; π) ˆq(u, Xπ u , aπ u ) du is a continuous local martingale with ﬁnite variation and hence zero quadratic variation. Therefore, P-almost surely, R s t e βu q(u, Xπ u , aπ u ; π) ˆq(u, Xπ u , aπ u ) du = 0 for all s [t, T]; see, e.g., (Karatzas and Shreve, 2014, Chapter 1, Exercise 5.21).

Denote f(t, x, a) = q(t, x, a; π) ˆq(t, x, a). Then f is a continuous function that maps [0, T] Rd A to R. Suppose the desired conclusion is not true, then there exists a triple (t , x , a ) and ϵ > 0 such that f(t , x , a ) > ϵ. Because f is continuous, there exists δ > 0 such that f(u, x , a ) > ϵ/2 for all (u, x , a ) with |u t | |x x | |a a | < δ. Here means taking the larger one, i.e., u v = max{u, v}.

Now consider the state process, still denoted by Xπ, starting from (t , x , a ), namely, {Xπ s , t s T} follows (6) with Xπ t = x and aπ t = a . Deﬁne

τ = inf{u t : |u t | > δ or |Xπ u x | > δ} = inf{u t : |Xπ u x | > δ} (t +δ).

The continuity of Xπ implies that u > t , P-almost surely. Here means taking the smaller one, i.e., u v = min{u, v}.

We have already proved that there exists Ω0 F with P(Ω0) = 0 such that for all ω Ω\ Ω0, R s t e βuf(u, Xπ u (ω), aπ u (ω))du = 0 for all s [t , T]. It follows from Lebesgue s diﬀerentiation theorem that for any ω Ω\ Ω0,

f(s, Xπ s (ω), aπ s (ω)) = 0, a.e. s [t , τ(ω)].

Consider the set Z(ω) = {s [t , τ(ω)] : aπ s (ω) Bδ(a )} [t , τ(ω)], where Bδ(a ) = {a A : |a a | > δ} is the neighborhood of a . Because f(s, Xπ s (ω), aπ s (ω)) > ϵ

2 when s Z(ω), we conclude that Z(ω) has Lebesgue measure zero for any ω Ω\Ω0. That is, Z

[t ,T] 1{s τ(ω)}1{aπ s (ω) Bδ(a )}ds = 0.

Continuous-Time q-Learning

Integrating ω with respect to P and applying Fubini s theorem, we obtain

[t ,T] 1{s τ(ω)}1{aπ s (ω) Bδ(a )}ds P(dω) = Z

[t ,T] E[1{s τ}1{aπ s Bδ(a )}]ds

t E 1{s τ}P (aπ s Bδ(a )|Fs) ds = Z T

Bδ(a ) π(a|s, Xπ s )da

min |x x |<δ,|u t |<δ

Bδ(a ) π(a|u, x )da

t E 1{s τ} ds

= min |x x |<δ,|u t |<δ

Bδ(a ) π(a|u, x )da

E[(τ T) t ] 0.

Since τ > t P-almost surely, the above implies min|x x |<δ,|u t |<δ n R

Bδ(a ) π(a|u, x )da o = 0. However, this contradicts Deﬁnition 1 about an admissible policy. Indeed, Deﬁnition 1-(i) stipulates supp π( |t, x) = A for any (t, x); hence R

Bδ(a ) π(a|t, x)da > 0.

Then the continuity in Deﬁnition 1-(ii) yields min|x x |<δ,|u t |<δ n R

Bδ(a ) π(a|u, x )da o > 0, a contradiction. Hence we conclude q(t, x, a; π) = ˆq(t, x, a) for every (t, x, a).

(ii) Applying Itˆo s lemma to e βs J(s, Xπ s ; π), we get

e βs J(s, Xπ s ; π) e βt J(t, x; π) + Z s

t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )]du

t e βu q(u, Xπ u , aπ u ; π) ˆq(u, Xπ u , aπ u ) du + Z s

x J(u, Xπ u ; π) σ(u, Xπ u , aπ u )d Wu.

So, when ˆq q( , , ; π), the above process is an ({Fs}s 0, P)-martingale on [t, T].

(iii) Let π Π be given satisfying the assumption in this part. It then follows from (ii) that R s t e βu[ˆq(u, Xπ u , aπ u ) q(u, Xπ u , aπ u ; π)]du is an ({Fs}s 0, P)-martingale. If the desired conclusion is not true, then the same argument in (i) still applies to conclude that min|x x |<δ,|u t |<δ n R

Bδ(a ) π (a|u, x )da o = 0, which is a contradiction because

π is admissible.

Proof of Theorem 7

(i) We only prove the only if part because the if part is straightforward following the same argument as in the proof of Theorem 6.

So, we assume e βt ˆJ(t, Xπ t )+ R t 0 e βs[r(s, Xπ s , aπ s ) ˆq(s, Xπ s , aπ s )]ds is an ({Fs}s 0, P)- martingale. Hence for any initial state (t, x), we have

EP e βT ˆJ(T, Xπ T ) + Z T

t e βs[r(s, Xπ s , aπ s ) ˆq(s, Xπ s , aπ s )]ds Ft

= e βt ˆJ(t, x).

Jia and Zhou

We integrate over the action randomization with respect to the policy π, and then obtain

EPW e βT ˆJ(T, Xπ T ) + Z T

A [r(s, Xπ s , a) ˆq(s, Xπ s , a)]π(a|s, Xπ s )dads FW t

= e βt ˆJ(t, x)

This, together with the terminal condition ˆJ(T, x) = h(x), and constraint (21) yields that

ˆJ(t, x) = EPW e β(T t)h( Xπ T ) + Z T

t e β(s t) Z

A [r(s, Xπ s , a) γ log π(s, Xπ s , a)]π(a|s, Xπ s )dads FW t

Hence ˆJ(t, x) = J(t, x; π) for all (t, x) [0, T] Rd. Furthermore, based on Theorem 6, the martingale condition implies that ˆq(t, x, a) = q(t, x, a; π) for all (t, x, a) [0, T] Rd A.

(ii) This follows immediately from Theorem 6-(ii).

(iii) Let π Π be given satisfying the assumption in this part. Deﬁne

ˆr(t, x, a) := ˆJ

t (t, x) + b(t, x, a) ˆJ

x(t, x) + 1

2σσ (t, x, a) 2 ˆJ

x2 (t, x) β ˆJ(t, x).

Then e βs ˆJ(s, Xπ s ) Z s

t e βuˆr(u, Xπ u , aπ u )du

is an ({Fs}s 0, P)-local martingale, which follows from applying Itˆo s lemma to the above process and arguing similarly to the proof of Theorem 6-(ii). As a result, R s t e βu[r(u, Xπ u , aπ u ) ˆq(u, Xπ u , aπ u )+ˆr(u, Xπ u , aπ u )]du is an ({Fs}s 0, P)-martingale. The same argument as in the proof of Theorem 6 applies and we conclude

ˆq(t, x, a) =r(t, x, a) + ˆr(t, x, a)

t (t, x) + b(t, x, a) ˆJ

x(t, x) + 1

2σσ (t, x, a) 2 ˆJ

x2 (t, x) β ˆJ(t, x) + r(t, x, a)

t (t, x) + H t, x, a, ˆJ

x(t, x), 2 ˆJ

x2 (t, x) β ˆJ(t, x)

for every (t, x, a). Now the constraint (21) reads Z

t (t, x) + H t, x, a, ˆJ

x(t, x), 2 ˆJ

x2 (t, x) β ˆJ(t, x) γ log π(a|t, x)

π(a|t, x)da = 0,

for all (t, x), which, together with the terminal condition ˆJ(T, x) = h(x), is the Feynman Kac PDE (11) for ˆJ. Therefore, it follows from the uniqueness of the solution to (11) that ˆJ J( , ; π). Moreover, it follows from Theorem 6-(iii) that ˆq q( , , ; π).

Finally, if π(a|t, x) = exp{ 1

γ ˆq(t,x,a)} R

γ ˆq(t,x,a)}da, then π = Iπ where I is the map deﬁned in

Theorem 2. This in turn implies π is the optimal policy and, hence, ˆJ is the optimal value function

Continuous-Time q-Learning

Proof of Proposition 8

Proof Divide both sides of (24) by γ, take exponential and then integrate over A. Comparing the resulting equation with the exploratory HJB equation (14), we get

γ q (t, x, a)}da = 0.

This yields (25). The expression (26) follows then from (13).

Proof of Theorem 9

(i) Let c J = J and bq = q be the optimal value function and optimal q-function respectively. For any π Π, applying Itˆo s lemma to the process e βs J (s, Xπ s ), we obtain for 0 t < s T:

e βs J (s, Xπ s ) e βt J (t, x) + Z s

t e βu[r(u, Xπ u , aπ u ) q (u, Xπ u , aπ u )]du

t (u, Xπ u , aπ u ) + H u, Xπ u , aπ u , J

x (u, Xπ u ), 2J

x2 (u, Xπ u ) βJ (u, Xπ u )

q (u, Xπ u , aπ u ) i du + Z s

x J (u, Xπ u ) d Wu

x J (u, Xπ u ) σ(u, Xπ u , aπ u )d Wu,

where the R du term vanishes due to the deﬁnition of q in (24). Hence (28) is a martingale.

(ii) The second constraint in (27) implies that c π (a|t, x) := exp{ 1

γ bq (t, x, a)} is a prob-

ability density function, and bq (t, x, a) = γ log c π (a|t, x). So bq (t, x, a) satisﬁes the second constraint in (21) with respect to the policy c π . When (28) is an ({Fs}s 0, P)- martingale under the given admissible policy π, it follows from Theorem 7 (iii) that c J and bq are respectively the value function and the q-function associated with c π .

Then the improved policy is Ic π (a|t, x) := exp{ 1

γ b q (t,x,a)} R

γ b q (t,x,a)}da = exp{ 1

γ bq (t, x, a)} =

c π (a|t, x). However, Theorem 2 yields that c π is optimal, completing the proof.

Jia and Zhou

Proof of Theorem 10

Proof Let us compute the KL-divergence: Z

log π (a|t, x) 1

γ H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) π (a|t, x)da

π ( |t, x) exp{1

γ H t, x, , J

x(t, x; π), 2J

x2 (t, x; π) }

π( |t, x) exp{1

γ H t, x, , J

x(t, x; π), 2J

x2 (t, x; π) }

log π(a|t, x) 1

γ H t, x, a, J

x(t, x; π), 2J

x2 (t, x; π) π(a|t, x)da.

A q(t, x, a; π) γ log π (a|t, x) π (a|t, x)da R

A q(t, x, a; π) γ log π(a|t, x) π(a|t, x)da. Following the same argument as in the proof of Theorem 2, we conclude J(t, x; π ) J(t, x; π).

Proof of Lemma 11

Proof The ﬁrst statement has been prove in Jia and Zhou (2022b, Lemma 7). We focus on the proof of the second statement, which is similar to that of Theorem 2. For the two admissible policies π and π , we apply Itˆo s lemma to the value function under π along the process Xπ to obtain

J( Xπ u ; π) J( Xπ t ; π) + Z u

A [r( Xπ s , a) γ log π (a| Xπ s )]π (a| Xπ s )dads

H Xπ s , a, J

x( Xπ s ; π), 2J

x2 ( Xπ s ; π) γ log π (a|Xπ s ) π (a| Xπ s )dads

x J( Xπ s ; π) σ Xπ s , π ( | Xπ s ) d Ws

H Xπ s , a, J

x( Xπ s ; π), 2J

x2 ( Xπ s ; π) γ log π(a|Xπ s ) π(a| Xπ s )dads

x J( Xπ s ; π) σ Xπ s , π ( | Xπ s ) d Ws

=V (π)(u t) + Z u

x J( Xπ s ; π) σ Xπ s , π ( | Xπ s ) d Ws,

where the inequality is due to the same argument as in the proof of Theorem 10, and the last equality is due to the Feynman Kac formula (38). Therefore, after a localization argument, we have

1 T E J( Xπ T ; π) J(x; π)+ Z T

A [r( Xπ s , a) γ log π (a| Xπ s )]π (a| Xπ s )dads Xπ 0 = x V (π).

Taking the limit T , we obtain V (π ) V (π).

Continuous-Time q-Learning

Proof of Theorem 12

Proof The proof is parallel to that of Theorems 6 and 7, which is omitted.

L. C. Baird. Advantage updating. Technical report, Write Lab Wright-Patterson Air Force Base, OH 45433-7301, USA, 1993.

L. C. Baird. Reinforcement learning in continuous time: Advantage updating. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 94), volume 4, pages 2448 2453. IEEE, 1994.

C. Beck, M. Hutzenthaler, and A. Jentzen. On nonlinear Feynman Kac formulas for viscosity solutions of semilinear parabolic partial diﬀerential equations. Stochastics and Dynamics, page 2150048, 2021.

G. Brunick and S. Shreve. Mimicking an Itˆo process by a solution of a stochastic diﬀerential equation. The Annals of Applied Probability, 23(4):1584 1628, 2013.

K. Doya. Reinforcement learning in continuous time and space. Neural Computation, 12 (1):219 245, 2000.

Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforcement learning for continuous control. In International Conference on Machine Learning, pages 1329 1338. PMLR, 2016.

J. Fan, Z. Wang, Y. Xie, and Z. Yang. A theoretical analysis of deep Q-learning. In Learning for Dynamics and Control, pages 486 489. PMLR, 2020.

W. H. Fleming and H. M. Soner. Controlled Markov Processes and Viscosity Solutions, volume 25. Springer Science & Business Media, 2006.

X. Gao, Z. Q. Xu, and X. Y. Zhou. State-dependent temperature control for Langevin diﬀusions. SIAM Journal on Control and Optimization, pages 1 26, 2020.

S. Gu, T. Lillicrap, I. Sutskever, and S. Levine. Continuous deep Q-learning with modelbased acceleration. In International Conference on Machine Learning, pages 2829 2838. PMLR, 2016.

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Oﬀ-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861 1870. PMLR, 2018a.

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel, et al. Soft actor-critic algorithms and applications. ar Xiv preprint ar Xiv:1812.05905, 2018b.

Jia and Zhou

X. Han, R. Wang, and X. Y. Zhou. Choquet regularization for reinforcement learning. ar Xiv preprint ar Xiv:2208.08497, 2022.

Y. Jia and X. Y. Zhou. Policy evaluation and temporal-diﬀerence learning in continuous time and space: A martingale approach. Journal of Machine Learning Research, 23(198): 1 55, 2022a.

Y. Jia and X. Y. Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms. Journal of Machine Learning Research, 23(275):1 50, 2022b.

I. Karatzas and S. Shreve. Brownian Motion and Stochastic Calculus, volume 113. Springer, 2014.

J. Kim, J. Shin, and I. Yang. Hamilton-Jacobi deep Q-learning for deterministic continuoustime systems with Lipschitz continuous controls. Journal of Machine Learning Research, 22:206 1, 2021.

A. E. Lim and X. Y. Zhou. Mean-variance portfolio selection with random parameters in a complete market. Mathematics of Operations Research, 27(1):101 120, 2002.

X. Mao. Stochastic Diﬀerential Equations and Applications. Woodhead, 2 edition, 2007. ISBN 978-1-904275-34-3.

F. S. Melo, S. P. Meyn, and M. I. Ribeiro. An analysis of reinforcement learning with function approximation. In Proceedings of the 25th International Conference on Machine Learning, pages 664 671, 2008.

S. P. Meyn and R. L. Tweedie. Stability of Markovian processes III: Foster Lyapunov criteria for continuous-time processes. Advances in Applied Probability, 25(3):518 548, 1993.

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015.

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928 1937. PMLR, 2016.

B. O Donoghue. Variational Bayesian reinforcement learning with regret bounds. Advances in Neural Information Processing Systems, 34:28208 28221, 2021.

I. Osband, B. Van Roy, D. J. Russo, and Z. Wen. Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1 62, 2019.

J. Schulman, X. Chen, and P. Abbeel. Equivalence between policy gradients and soft Qlearning. ar Xiv preprint ar Xiv:1704.06440, 2017.

D. Shah and Q. Xie. Q-learning with nearest neighbors. Advances in Neural Information Processing Systems, 31, 2018.

Continuous-Time q-Learning

Y. Sun. The exact law of large numbers via Fubini extension and characterization of insurable risks. Journal of Economic Theory, 126(1):31 69, 2006.

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2018.

C. Tallec, L. Blier, and Y. Ollivier. Making deep Q-learning methods robust to time discretization. In International Conference on Machine Learning, pages 6096 6104. PMLR, 2019.

W. Tang, Y. P. Zhang, and X. Y. Zhou. Exploratory HJB equations and their convergence. SIAM Journal on Control and Optimization, 60(6):3191 3216, 2022.

M. Uehara, C. Shi, and N. Kallus. A review of oﬀ-policy evaluation in reinforcement learning. ar Xiv preprint ar Xiv:2212.06355, 2022.

H. Wang and X. Y. Zhou. Continuous-time mean variance portfolio selection: A reinforcement learning framework. Mathematical Finance, 30(4):1273 1308, 2020.

H. Wang, T. Zariphopoulou, and X. Y. Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21 (198):1 34, 2020.

C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279 292, 1992.

C. J. C. H. Watkins. Learning from delayed rewards. Ph D thesis, Cambridge University, 1989.

W. M. Wonham. On the separation theorem of stochastic control. SIAM Journal on Control, 6(2):312 326, 1968.

J. Yong and X. Y. Zhou. Stochastic Controls: Hamiltonian Systems and HJB Equations. New York, NY: Spinger, 1999.

X. Y. Zhou. Curse of optimality, and how do we break it. SSRN preprint SSRN 3845462, 2021.

X. Y. Zhou and D. Li. Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1):19 33, 2000.

X. Y. Zhou and G. Yin. Markowitz s mean-variance portfolio selection with regime switching: A continuous-time model. SIAM Journal on Control and Optimization, 42(4):1466 1482, 2003.

S. Zou, T. Xu, and Y. Liang. Finite-sample analysis for SARSA with linear function approximation. Advances in Neural Information Processing Systems, 32, 2019.