# pomdps_in_continuous_time_and_discrete_spaces__22741917.pdf POMDPs in Continuous Time and Discrete Spaces Bastian Alt Matthias Schultheis , Heinz Koeppl , Department of Electrical Engineering and Information Technology Centre for Cognitive Science Technische Universität Darmstadt {bastian.alt, matthias.schultheis, heinz.koeppl}@bcs.tu-darmstadt.de Many processes, such as discrete event systems in engineering or population dynamics in biology, evolve in discrete space and continuous time. We consider the problem of optimal decision making in such discrete state and action space systems under partial observability. This places our work at the intersection of optimal filtering and optimal control. At the current state of research, a mathematical description for simultaneous decision making and filtering in continuous time with finite state and action spaces is still missing. In this paper, we give a mathematical description of a continuous-time partial observable Markov decision process (POMDP). By leveraging optimal filtering theory we derive a Hamilton-Jacobi-Bellman (HJB) type equation that characterizes the optimal solution. Using techniques from deep learning we approximately solve the resulting partial integro-differential equation. We present (i) an approach solving the decision problem offline by learning an approximation of the value function and (ii) an online algorithm which provides a solution in belief space using deep reinforcement learning. We show the applicability on a set of toy examples which pave the way for future methods providing solutions for high dimensional problems. 1 Introduction Continuous-time models have extensively been studied in machine learning and control. They are especially beneficial to reason about latent variables at time points which are not included in the data. In a broad range of topics such as natural language processing [49], social media dynamics [31] or biology [18] to name just a few, the underlying process naturally evolves continuously in time. In many applications the control of such time-continuous models is of interest. There exist already numerous approaches which tackle the control problem of continuous state space systems, however, for many processes a discrete state space formulation is more suited. This class of systems is discussed in the area of discrete event systems [10]. Decision making in these systems has a long history, yet, if the state is not fully observed acting optimally in such systems is notoriously hard. Many approaches resort to heuristics such as applying a separation principle between inference and control. Unfortunately, this can lead to weak performance as the agent does not incorporate effects of its decisions for future inference. In the past, this problem was also approached by using a discrete time formulation such as a POMDP model [22]. Nevertheless, it is not always straight-forward to discretize the problem as it requires adding pseudo observations for time points without observations. Additionally, the time discretization can lead to problems when learning optimal controllers in the continuous-time setting [44]. A more principled way to approach this problem is to define the model in continuous time with a proper observation model and to solve the resulting model formulation. Still, it is not clear a priori, how to design such a model and even less how to control it in an optimal way. In this paper, we provide a formulation of this problem by introducing a continuous-time analogue to the POMDP 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. framework. We additionally show how methods from deep learning can be used to find approximate solutions for control under the continuous-time assumption. Our work can be seen as providing a first step towards scalable control laws for high dimensional problems by making use of further approximation methods. An implementation of our proposed method is publicly available1. Related Work Historically, optimal control in continuous time and space is a classical problem and has been addressed ever since the early works of Pontryagin [35] and Kalman [23]. Continuous-time reinforcement learning formulations have been studied [13, 5, 48] and resulted in works such as the advantage updating and advantage learning algorithms [3, 17] and more recently in function approximation methods [44]. Continuous-time formulations in the case of full observability and discrete spaces are regarded in the context of semi-Markov decision processes (SMDPs) [4, 7], with applications e.g., in E-commerce [53] or as part of the options framework [43, 21]. An important area for the application of time-continuous discrete-state space models is discussed in the queueing theory literature [6]. Here, the state space describes the dynamics of the discrete number of customers in the queue. These models are used for example for internet communication protocols, such as in TCP congestion control. More generally, the topic is also considered within the area of stochastic hybrid systems [11], where mixtures of continuous and discrete state spaces are discussed. Though, the control under partial observability is often only considered under very specific assumptions on the dynamics. A classical example for simultaneous optimal filtering and control in continuous space is the linear quadratic Gaussian (LQG) controller [41]. In case of partial observability and discrete spaces, the widely used POMDP framework [22] builds a sound foundation for optimal decision making in discrete time and is the basis of many modern methods, e.g., [37, 54, 19]. By applying discretizations, it was also used to solve continuous state space problems as discussed in [12]. Another existing framework which is close to our work is the partial observable semi-Markov decision process (POSMDP) [30] which has applications in fields such as robotics [34] or maintenance [40]. One major drawback of this framework is that observations are only allowed to occur at time points where the latent continuous-time Markov chain (CTMC) jumps. This assumption is actually very limiting, as in many applications observations are received at arbitrary time points or knowledge about jump times is not available. The development of numerical solution methods for high dimensional partial differential equations (PDEs), such as the HJB equation, is an ongoing topic of research. Popular approaches include techniques based on linearization such as differential dynamic programming [20, 46], stochastic simulation techniques as in the path integral formalism [24, 47] and collocation-based approaches as in [38]. Latter have been extensively discussed due to the recent advances of function approximation by neural networks which have achieved huge empirical success [26]. Approaches solving HJB equations among other PDEs using deep learning can be found in [45, 29, 15, 39]. In the following section we introduce several results from optimal filtering and optimal control, which help to bring our work into perspective. 2 Background Continuous-Time Markov Chains. First, we discuss the basic notion of the continuous-time Markov chain (CTMC) {X(t)}t 0, which is a Markov model on discrete state space X N and in continuous time t 2 R 0. Its evolution for a small time step h is characterized by P(X(t + h) = x | X(t) = x0) = (x = x0) + (x0, x)h + o(h), with limh!0 h = 0. Here, : X X ! R is the rate function, where (x0, x) denotes the rate of switching from state x0 to state x. We define the exit rate of a state x 2 X as (x) := P x06=x (x, x0) and (x, x) := (x). Due to the memoryless property of the Markov process, the waiting times between two jumps of the chain is exponentially distributed. The definition can also be extended to time dependent rate functions, which correspond to a time inhomegenouos CTMC, for further details see [33]. 1https://git.rwth-aachen.de/bcs/pomdps_continuous_time_discrete_spaces Optimal Filtering of Latent CTMCs. In the case of partial observability, we do not observe the CTMC X(t) directly, but observe a stochastic process Y (t) depending on X(t). In this setting the goal of optimal filtering [2] is to calculate the filtering distribution, which is also referred to as the belief state at time t, (x, t) := P(X(t) = x | y[0,t)), where y[0,t) := {y(s)}s2[0,t) is the set of observations in the time interval [0, t). The filtering distribution provides a sufficient statistic to the latent state X(t). We denote the filtering distribution in vector form with components { (x, t)}x2X by (t) 2 |X| and use |X| for the continuous belief space which is a |X| dimensional probability simplex. In general, a large class of filter dynamics follow the description of a jump diffusion process, see [16] for an introduction and the control thereof. The filter dynamics are dependent on an observation model, of which several have been discussed in the literature. A continuous-discrete filter [18] for example uses a model with covariates {yi}i=1,...,n observed at discrete time points t1 < t2 < < tn, i.e., yi := y(ti), according to yi p(yi | x), where p(yi | x) := p(yi | X(ti) = x) is a probability density or mass function. In this setting the filtering distribution (x, t) = P(X(t) = x | y1, . . . , yn), with tn < t follows the usual forward equation (x0, x) (x0, t), (1) between observation times ti and obeys the following reset conditions (x, ti) = p(yi | x) (x, t x0 p(yi | x0) (x0, t at observation times where (x, t i ) denotes the belief just before the observation. The reset conditions at the observation time points represent the posterior distribution of X(t) which combines the likelihood model p(y | x) with the previous filtering distribution (x0, t i ) as prior. The filter equations Eqs. (1) and (2) can be written in the differential form as (x0, x) (x0, t)dt + [ (x, t N(t)) (x, t)]d N(t), which is a special case of a jump diffusion without a diffusion part. The observations enter the filtering equation by the counting process N(t) = Pn (ti t) and the jump amplitude (x, t N(t)) (x, t), which sets the filter to the new posterior distribution (x, t N(t)), see Eq. (2). Other optimal filters can be derived by solving the corresponding Kushner-Stratonovic equation [25, 36]. One instance is the Wonham filter [52, 18] which is discussed in Appendix A.1. An important observation is that for the case of finite state spaces the filtering distribution is described by a finite dimensional object, opposed to the case of, e.g., uncountable state spaces, where the filtering distribution is described by a probability density, which is infinite dimensional. This is helpful as the filtering distribution can be used as a state in a control setting. Optimal Control and Reinforcement Learning. Next, we review some results from continuoustime optimal control [41] and reinforcement learning [42], which can be applied to continuous state spaces. Consider a stochastic differential equation (SDE) of the form d X(t) = f(X(t), u(t))dt + G(X(t), u(t))d W(t), where dynamical evolution of the state X(t) is corrupted by noise following standard Brownian motion W(t). The control is a function u : R 0 ! U, where U is an appropriate action space. In traditional stochastic optimal control the optimal value function, also named cost to go, of a controlled dynamical system is defined by the expected cumulative reward under the optimal policy, V (x) := max R(X(s), u(s)) ds %%%% X(t) = x where maximization is carried out over all control trajectories u[t,1) := {u(s)}s2[t,1). The performance measure or reward function is given by R : X U ! R and denotes the discount factor2. 2Note that the discount factor is sometimes defined using certain transformations, e.g., 1 We use a normalization by 1/ for the value function as it was found to stabilize its learning process when function approximations are used [45]. The optimal value function is given by a PDE, the stochastic HJB equation V (x) = max R(x, u) + @V (x) @x2 G(x, u)G(x, u)> An optimal policy µ : X ! U can be found by solving the PDE and maximizing the right hand side of the HJB equation for every x 2 X. For a deterministic policy µ : X ! U we define its value as the expected cumulative reward V µ(x) := E R(X(s), U(s)) ds %%%% X(t) = x where the control U(s) = µ(X(s)), s 2 [t, 1) follows the policy µ. If the state dynamics are deterministic and the state is differentiable in time, which is the case if there is no diffusion, i.e., G(x, u) = 0, one can derive a differential equation for the value function [13]. This is achieved by evaluating the value function V µ(x(t)) at the function x(t) and then differentiating both sides of Eq. (4) by time resulting in dt V µ(x(t)) = V µ(x(t)) R(x(t), µ(x(t))). (5) The residuum of Eq. (5) can be identified as a continuous-time analog of the temporal difference (TD)- error δ(t) := r(t) V µ(x(t)) + d dt V µ(x(t)), which an agent should minimize using an observed reward signal r(t) = R(x(t), µ(x(t))) to solve the optimal control problem using reinforcement learning in an online fashion. In this presented case of stochastic optimal control we have assumed that the state is directly usable for the controller. In the partially observable setting, in contrast, this is not the case and the controller can only rely on observations of the state. Hence, the sufficient statistic, the belief state, has to be used for calculating the optimal action. Next, we describe how to model the discrete state space version of the latter case. 3 The Continuous-Time POMDP Model In this paper we consider a continuous-time equivalent of a POMDP model with time index t 2 R 0 defined by the tuple h X, U, Y, , PY |X,u, R, i. We assume a finite state space X and a finite action space U. The observation space Y is either a discrete space or an uncountable space. The latent controlled Markov process follows a CTMC with rate function : X X U ! R, i.e., P(X(t + h) = x | X(t) = x0, u(t) = u) = (x = x0)+ (x0, x | u)h+o(h), with exit rate (x | u) = P x06=x (x, x0 | u). Note that the rate function is a function of the control variable. Therefore, the system dynamics are described by a time inhomogeneous CTMC. The underlying process X(t) cannot be directly observed but an observation process Y (t) is available providing information about X(t). The observation model is specified by the conditional probability measure PY |X,u. The reward function is given by R : X U ! R. Throughout, we consider an infinite horizon problem with discount . We denote the filtering distribution for the latent state X(t) at time t by (x, t) = P(X(t) = x | y[0,t), u[0,t)) with belief state (t) 2 |X|. This filtering distribution is used as state of the partially observable system. Additionally, we define µ : |X| ! U as a deterministic policy, which maps a belief state to an action. The performance of a policy µ with control U(s) = µ( (s)), s 2 [0, 1) in the infinite horizon case is given analogously to the standard optimal control case by R(X(s), U(s))ds where the expectation is carried out w.r.t. both the latent state and observation processes. 3.1 Simulating the Model First, we describe a generative process to draw trajectories from a continuous-time POMDP. Consider the stochastic simulation with given policy µ for a trajectory starting at time t with state X(t) = x0, and belief (t). The first step is to draw a waiting time of the latent CTMC after which the state switches. This CTMC is time inhomogeneous with exit rate (x0 | µ( (t))) since the control modulates the transition function. Therefore, we sample from the time dependent exponential distribution with CDF P( ) = 1 exp( (x0 | µ( (s)))ds) for which one needs to solve the filtering trajectory { (s)}s2[t,t+ ) using a numeric (stochastic) differential equation solver beforehand. There are multiple ways to sample from time dependent exponential distributions. A convenient method to jointly calculate the filtering distribution and the latent state trajectory is provided by the thinning algorithm [27]. For an adaptation for the continuous-time POMDP see Appendix B.1. As second step, after having sampled the waiting time , we can draw the next state X(t + ) given from the categorical distribution P(X(t + ) = x | x0, ) = (x0,x|µ( (t+ ))) (x0|µ( (t+ ))) if x0 6= x, 0 otherwise. The described process is executed for the initial belief and state at t = 0 and repeated until the total simulation time has been reached. 3.2 The HJB Equation in Belief Space Next, we derive an equation for the value function, which can be solved to obtain an optimal policy. The infinite horizon optimal value function is given by the expected discounted cumulative reward V ( ) = max R(X(s), u(s))ds The value function depends on the belief state which provides a sufficient statistic for the state. By splitting the integral into two terms from t to t + h and from t + h to 1, we have V ( ) = max R(X(s), u(s))ds + R(X(s), u(s))ds %%%%% (t) = and by identifying the second integral as e h V ( (t + h)), we find the stochastic principle of optimality as V ( ) = max R(X(s), u(s))ds + e h V ( (t + h)) %%%%% (t) = Here, we consider the class of filtering distributions which follow a jump diffusion process d (t) = f( (t), u(t))dt + G( (t), u(t))d W(t) + h( (t), u(t))d N(t), (8) where f : |X| U ! R|X| denotes the drift function, G : |X| U ! R|X| m the dispersion matrix, with W(t) 2 Rm being an m dimensional standard Brownian motion and h : |X| U ! R|X| denotes the jump amplitude. We assume a Poisson counting process N(t) PP(N(t) | λ) with rate λ for the observation times {ti}i2N, which implies that ti ti 1 Exp(ti ti 1 | λ). By applying Itô s formula for the jump diffusion processes in Eq. (8) to Eq. (7), dividing both sides by h and taking the limit h ! 0 we find the PDE V ( ) = max E [R(x, u) | ] + @V ( ) @ f( , u) + @ 2 G( , u)G( , u)> + λ (E [V ( + h( , u)] V ( ))} , where E [R(x, u) | ] = P x R(x, u) (x). For a detailed derivation see Appendix A.2. We will focus mainly on the case of a controlled continuous-discrete filter (x0, x | u(t)) (x0, t)dt + p(y N(t) | x, u(t)) (x, t) P x0 p(y N(t) | x0, u(t)) (x0, t) (x, t) For this filter, the resulting HJB equation is given by the partial integro-differential equation V ( ) = max R(x, u) (x) + @ (x) (x0, x | u) (x0) + λ p(y | x, u) (x)V ( +)dy V ( ) with +(x) = p(y|x,u) (x) P x0 p(y|x0,u) (x0). A derivation for the HJB equation for the continuous-discrete filter and for a controlled Wonham filter is provided in Appendix A.3. To find the optimal policy µ corresponding to the optimal value function V ( ), it is useful to define the optimal advantage function coinciding with Eq. (9) as A ( , u) := E [R(x, u) | ] V ( ) + @V ( ) @ 2 G( , u)G( , u)> + λ (E [V ( + h( , u))] V ( )) . In order to satisfy Eq. (9), the consistency equation u2U A ( , u) = 0 (11) is required. The optimal policy can then be obtained as µ ( ) = arg maxu2U A ( , u). 4 Algorithms For the calculation of an optimal policy we have to find the advantage function. As it dependents on the value function, we have to calculate both of them. Since the PDE for the value function in Eq. (9) does not have a closed form solution, we present next two methods to learn the functions approximately. 4.1 Solving the HJB Equation Using a Collocation Method For solving the HJB equation (9) we first apply a collocation method [38, 45, 29] with a parameterized value function Vφ( ), which is modeled by means of a neural network. We define the residual without the maximum operator of the HJB equation under the parameterization as the advantage Aφ( , u) := E [R(x, u) | ] Vφ( ) + @Vφ( ) @ 2 G( , u)G( , u)> + λ (E [Vφ( + h( , u))] Vφ( )) . For learning, we sample beliefs { (i)}i=1,...,N, with (i) 2 |X|, from some base distribution (i) p( ) and estimate the optimal parameters ˆφ = arg minφ maxu2U Aφ( (i), u) 2 by minimizing the squared loss. If needed, we can approximate the expectation E Vφ( (i) + h( (i), u)) over the observation space by sampling. An algorithm for this procedure is given in Appendix B.2. For learning it is required to calculate the gradient @Vφ( ) @ and Hessian @2Vφ( ) @ 2 of the value function w.r.t. the input . Generally, this can be achieved by automatic differentiation, but for a fully connected multi-layer feedforward network, the analytic expressions are given in Appendix A.4. The analytic expression makes it possible to calculate the gradient, Hessian and value function in one single forward pass [28]. Given the learned value function V ˆφ, we learn an approximation of the advantage function A ( , u) to obtain an approximately-optimal policy. To this end we use a parametrized advantage function A ( , u) and employ a collocation method to solve Eq. (10). To ensure the consistency Eq. (11) during learning, we apply a reparametrization [44, 50] as A ( , u) = A ( , u) maxu02U A ( , u0). The optimal parameters are found by minimizing the squared loss as ˆ = arg min A ( (i), u) max u0 A ( (i), u0) A ˆφ( (i), u) The corresponding policy can then be easily determined as µ( ) = arg maxu2U A ˆ ( , u) using a single forward pass through the learned advantage function. 4.2 Advantage Updating The HJB equation (9) can also be solved online using reinforcement learning techniques. We apply the advantage updating algorithm [3] and solve Eq. (10) by employing neural network function approximators for both the value function Vφ( ) and the advantage function A ( , u). The expectations in Eq. (10) are estimated using sample runs of the POMDP. Hence, the residual error can be calculated as Eφ, (t) =A ( (t), u(t)) r(t) + Vφ( (t)) @Vφ( (t)) @ f( (t), u(t)) @ 2 G( (t), u(t))G( (t), u(t))> Vφ( (t+)) Vφ( (t)) which can also be seen in terms of a continuous-time TD-error δφ(t) as Eφ, (t) = A ( (t), u(t)) δφ(t). Again we apply the reparametrization A ( , u) = A ( , u) maxu02U A ( , u0) to satisfy Eq. (11). For estimating the optimal parameters ˆφ and ˆ we minimize the sum of squared residual errors. The data for learning is generated by simulating episodes under an exploration policy as in [44]. As exploration policy, we employ a time variant policy µ( , t) = arg maxu2U{A ( , u) + (u, t)}, where (u, t) is a stochastic exploration process which we choose as the Ornstein-Uhlenbeck process d (u, t) = (u, t)dt + σd W(t). Generated trajectories are subsampled and saved to a replay buffer [32] which is used to provide data for the training procedure. 5 Experiments Experimental Tasks. We tested our derived methods on several toy tasks of continuous-time POMDPs with discrete state and action space: An adaption of the popular tiger problem [9], a decentralized multi-agent network transmission problem [8] implementing the slotted aloha protocol, and a grid world problem. All problems are adapted to continuous time and observations at discrete time points. In the following, we provide a brief overview over the considered problems. A more detailed description containing the defined spaces and parameters can be found in Appendix C. In the tiger problem, the state consists of the position of a tiger (left/right) and the agent has to decide between three actions for either improving his belief (listen) or exploiting his knowledge to avoid the tiger (left/right). While listening the agent can wait for an observation. In the transmission problem, a set of stations has to adjust their sending rate in order to successfully transmit packages over a single channel as in the slotted Aloha protocol. Each station might have a package to send or not, and the only observation is the past state of the channel, which can be either idle, transmission, or collision. New packages arrive with a fixed rate at the stations. As for each number of available packages with the exception of no packages available there is a unique optimal action, a finite number of transmission rates as actions is sufficient. In the grid world problem, an agent has to navigate through a grid world by choosing the directions in which to move next. The agent transitions with an exponentially distributed amount of time and, while doing so, can slip with some probability so that he instead moves into another direction. The agent receives only noisy information of his position from time to time. Results. Both the offline collocation method and the online advantage updating method manage to learn reasonable approximations of the value and advantage functions for the considered problems. For the tiger problem, the learned value and advantage function over the whole belief space are visualized in Fig. 1. The parabolic shape of the value function correctly indicates that higher certainty of the belief about the tiger s position results in a higher expected discounted cumulative reward as the agent can exploit this knowledge to omit the tiger. Note that the value function learned by the online advantage updating method differs marginally from the one learned by collocation in shape. This is due to the fact that in the advantage updating method only actually visited belief states are used in the learning process to approximate the value function and points in between need to be interpolated. 0.0 0.2 0.4 0.6 0.8 1.0 Belief tiger right Value function Advantage function 0.0 0.2 0.4 0.6 0.8 1.0 Belief tiger right Listen Open left Open right Figure 1: Learned value and advantage function for the continuous-time tiger problem. The upper plots show the approximated functions learned by the offline collocation method while for lower plots, the online advantage updating method was applied. The orange bars in the lower left plot show the proportions of the beliefs encountered during online learning. The advantage functions on the right can be used to determine the policy of the optimal agent. 0 1 2 3 4 5 0.085 0.090 0.095 0.100 Value function Figure 2: Learned value and advantage function for certain beliefs in the continuous-time gridworld problem using the advantage updating method. Being at the goal at position (3, 2) results in a reward for the agent. The black fields in row 3 represent a wall that cannot be crossed. Colors of the fields indicate the approximated value function while arrows show the proportions of the advantage functions among different actions. The yellow curvy path indicates a respective random run starting at field (0,0) under the resulting policy. The advantage function correctly indicates that for uncertain beliefs it is advantageous to first gain more information by executing the action listen. On the other hand, for certain beliefs directly opening the respective door is more useful in terms of reward and opening the opposed door considered as even more disadvantageous in these cases. Results for the gridworld problem are visualized in Fig. 2 which shows the learned value and advantage function using the online advantage updating method. The figure visualizes the resulting values for certain beliefs, i.e., being at the respective fields with probability one. As expected, the learned value function assigns higher values to fields which are closer to the goal position (3, 2). Actions leading to these states have higher advantage values. For assessing results for uncertain beliefs which are actually encountered when running the system, the figure also contains a sample run which successfully directs the agent to the goal. Results for the collocation method are found to be very similar. A respective plot can be found in Appendix C.2. For the slotted Aloha transmission problem, a random execution of the policy learned by the offline collocation method is shown in Fig. 3. The upper two plots show the true states, i.e., number of packages and past system state, and the agent s beliefs which follow the prior dynamics of the system and jump when new observations are made. In the plot at the bottom, the learned advantage function and the resulting policy for the encountered beliefs are visualized. As derived in the appendix, in case of perfect information of the number of packages n, it is optimal to execute action n 1 with exception of n = 0 where the action does not matter. When facing uncertainty, however, an optimistic behavior which is also reflected by the learned policy is more reasonable: As for a lower number of packages, the probability of sending a package and therefore collecting higher reward is higher. Thus, in case of uncertainty, one should opt for a lower action than the one executed under perfect information. Results for the online advantage updating method Transmission 0 20 40 60 80 100 Time Figure 3: A sample run for the slotted Aloha transmission problem using a policy learned by the collocation method. The upper plot shows the actual number of packages available in the system while the middle one shows the past system state which can be either idle (i), transmission (t), or collision (c). The background of both plots indicate the marginal belief of number of packages and past system state, respectively. The past system states that are observed at discrete time points are indicated by a cross. The lower plot shows the policy of the agent resulting from the learned advantage function while the per timestep normalized advantage function is indicated by the background. look qualitatively equal reflecting the same reasonable behavior. A respective plot is omitted here but can be found in Appendix C.2. 6 Conclusion In this work we introduced a new model for decision making in continuous time under partial observability. We presented (i) a collocation-based offline method and (ii) an advantage updating online algorithm, which both find approximate solutions by the use of deep learning techniques. For evaluation we discussed qualitatively the found solutions for a set of toy problems that were adapted from literature. The solutions have shown to represent reasonable optimal decisions for the given problems. In the future we are interested in exploring ways for making the proposed model applicable to more realistic large-scale problems. First, throughout this work we made the assumption of a known model. In many applications, however, this might not be the case and investigating how to realize dual control methods [14, 1] might be a fruitful direction. New scalable techniques for estimating parameters of latent CTMCs which could be used, are discussed in [51] but also learning the filtering distribution directly from data might be an option if the model is not available [19]. An issue we faced, was that for high dimensional problems the algorithms seemed to slowly converge to the optimal solution as the belief space grows linearly in the number of dimensions w.r.t. the number of states of the latent process. The introduction of variational and sampling methods seems to be promising to project the filtering distribution to a lower dimensional space and make solution of high-dimensional problems feasible. This could also enable the extension to continuous state spaces, where the filtering distribution is in general infinite dimensional and has to be projected to a finite dimensional representation, e.g., as it is done in assumed density filtering [36]. This will enable the use for interesting applications for example in queuing networks or more generally, in stochastic hybrid systems. Acknowledgments This work has been funded by the German Research Foundation (DFG) as part of the project B4 within the Collaborative Research Center (CRC) 1053 MAKI, and by the European Research Council (ERC) within the CONSYN project, grant agreement number 773196. Broader Impact Not applicable to this manuscript. [1] B. Alt, A. Šoši c, and H. Koeppl. Correlation priors for reinforcement learning. In Advances in Neural Information Processing Systems, pages 14155 14165, 2019. [2] A. Bain and D. Crisan. Fundamentals of stochastic filtering, volume 60. Springer Science & Business Media, 2008. [3] L. C. Baird. Reinforcement learning in continuous time: Advantage updating. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN 94), volume 4, pages 2448 2453. IEEE, 1994. [4] D. P. Bertsekas. Dynamic programming and optimal control, volume 2. Athena scientific Belmont, MA, [5] S. Bhasin, R. Kamalapurkar, M. Johnson, K. G. Vamvoudakis, F. L. Lewis, and W. E. Dixon. A novel actor critic identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica, 49(1):82 92, 2013. [6] G. Bolch, S. Greiner, H. De Meer, and K. S. Trivedi. Queueing networks and Markov chains: modeling and performance evaluation with computer science applications. John Wiley & Sons, 2006. [7] S. J. Bradtke and M. O. Duff. Reinforcement learning methods for continuous-time Markov decision problems. In Advances in neural information processing systems, pages 393 400, 1995. [8] A. R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. Ph D thesis, Department of Computer Science, Brown University, Providence, RI, 1998. [9] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic domains. In AAAI, volume 94, pages 1023 1028, 1994. [10] C. G. Cassandras and S. Lafortune. Introduction to discrete event systems. Springer Science & Business Media, 2009. [11] C. G. Cassandras and J. Lygeros. Stochastic hybrid systems. CRC Press, 2018. [12] P. Chaudhari, S. Karaman, D. Hsu, and E. Frazzoli. Sampling-based algorithms for continuous-time POMDPs. In 2013 American Control Conference, pages 4604 4610. IEEE, 2013. [13] K. Doya. Reinforcement learning in continuous time and space. Neural computation, 12(1):219 245, [14] A. Feldbaum. Dual control theory. i. Avtomatika i Telemekhanika, 21(9):1240 1249, 1960. [15] J. Han, A. Jentzen, and E. Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505 8510, 2018. [16] F. B. Hanson. Applied stochastic processes and control for jump-diffusions: modeling, analysis and computation. SIAM, 2007. [17] M. E. Harmon and L. C. Baird III. Multi-player residual advantage learning with general function approximation. Wright Laboratory, WL/AACF, Wright-Patterson Air Force Base, OH, pages 45433 7308, 1996. [18] L. Huang, L. Pauleve, C. Zechner, M. Unger, A. S. Hansen, and H. Koeppl. Reconstructing dynamic molecular states from single-cell time series. Journal of The Royal Society Interface, 13(122):20160533, 2016. [19] M. Igl, L. Zintgraf, T. A. Le, F. Wood, and S. Whiteson. Deep variational reinforcement learning for POMDPs. In International Conference on Machine Learning, pages 2117 2126, 2018. [20] D. H. Jacobson. New second-order and first-order algorithms for determining optimal control: A differential dynamic programming approach. Journal of Optimization Theory and Applications, 2(6):411 440, 1968. [21] A. Jain and D. Precup. Eligibility traces for options. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pages 1008 1016. International Foundation for Autonomous Agents and Multiagent Systems, 2018. [22] L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99 134, 1998. [23] R. Kalman. The theory of optimal control and the calculus of variations. In Mathematical optimization techniques, pages 309 331. University of California Press Los Angeles, CA, 1963. [24] H. J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment, 2005(11):P11011, 2005. [25] H. J. Kushner. On the differential equations satisfied by conditional probablitity densities of Markov processes, with applications. Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 2(1):106 119, 1964. [26] Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436 444, 2015. [27] P. W. Lewis and G. S. Shedler. Simulation of nonhomogeneous Poisson processes by thinning. Naval research logistics quarterly, 26(3):403 413, 1979. [28] M. Lutter, C. Ritter, and J. Peters. Deep Lagrangian networks: Using physics as model prior for deep learning. In International Conference on Learning Representations, 2019. [29] M. Lutter, B. Belousov, K. Listmann, D. Clever, and J. Peters. HJB optimal feedback control with deep differential value functions and action constraints. In Proceedings of the Conference on Robot Learning, pages 640 650. PMLR, 30 Oct 01 Nov 2020. [30] S. Mahadevan. Partially observable semi-Markov decision processes: Theory and applications in engi- neering and cognitive science. In AAAI Fall Symposium on Planning with Partially Observable Markov Decision Processes, Orlando, FL, USA, pages 113 120, 1998. [31] H. Mei and J. M. Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. In Advances in Neural Information Processing Systems, pages 6754 6764, 2017. [32] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529 533, 2015. [33] J. R. Norris. Markov chains. Number 2 in Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge university press, 1998. [34] S. Omidshafiei, A.-A. Agha-Mohammadi, C. Amato, S.-Y. Liu, J. P. How, and J. Vian. Graph-based cross entropy method for solving multi-robot decentralized POMDPs. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 5395 5402. IEEE, 2016. [35] L. S. Pontryagin. The mathematical theory of optimal processes. John Wiley, 1962. [36] S. Särkkä and A. Solin. Applied stochastic differential equations, volume 10. Cambridge University Press, [37] D. Silver and J. Veness. Monte-carlo planning in large POMDPs. In Advances in neural information processing systems, pages 2164 2172, 2010. [38] A. Simpkins and E. Todorov. Practical numerical methods for stochastic optimal control of biological systems in continuous time and space. In 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 212 218. IEEE, 2009. [39] J. Sirignano and K. Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339 1364, 2018. [40] R. Srinivasan and A. K. Parlikad. Semi-Markov decision process with partial information for maintenance decisions. IEEE Transactions on Reliability, 63(4):891 898, 2014. [41] R. F. Stengel. Optimal control and estimation. Courier Corporation, 1994. [42] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT Press, 2018. [43] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181 211, 1999. [44] C. Tallec, L. Blier, and Y. Ollivier. Making deep q-learning methods robust to time discretization. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 6096 6104, Long Beach, California, USA, 09 15 Jun 2019. PMLR. [45] Y. Tassa and T. Erez. Least squares solutions of the HJB equation with neural network value-function approximators. IEEE transactions on neural networks, 18(4):1031 1041, 2007. [46] Y. Tassa, T. Erez, and W. D. Smart. Receding horizon differential dynamic programming. In Advances in neural information processing systems, pages 1465 1472, 2008. [47] E. A. Theodorou and E. Todorov. Stochastic optimal control for nonlinear Markov jump diffusion processes. In 2012 American Control Conference (ACC), pages 1633 1639. IEEE, 2012. [48] K. G. Vamvoudakis and F. L. Lewis. Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica, 46(5):878 888, 2010. [49] C. Wang, D. Blei, and D. Heckerman. Continuous time dynamic topic models. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 579 586, 2008. [50] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48, pages 1995 2003, 2016. [51] C. Wildner and H. Koeppl. Moment-based variational inference for Markov jump processes. In Interna- tional Conference on Machine Learning, pages 6766 6775, 2019. [52] W. M. Wonham. Some applications of stochastic differential equations to optimal nonlinear filtering. Journal of the Society for Industrial and Applied Mathematics, Series A: Control, 2(3):347 369, 1964. [53] H. Xie, Y. Li, and J. C. Lui. Optimizing discount and reputation trade-offs in e-commerce systems: Characterization and online learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7992 7999, 2019. [54] N. Ye, A. Somani, D. Hsu, and W. S. Lee. Despot: Online POMDP planning with regularization. Journal of Artificial Intelligence Research, 58:231 266, 2017.