# active_observing_in_continuoustime_control__c53d393e.pdf

Active Observing in Continuous-time Control

Samuel Holt University of Cambridge

sih31@cam.ac.uk

Alihan Hüyük University of Cambridge

ah2075@cam.ac.uk

Mihaela van der Schaar University of Cambridge

mv472@cam.ac.uk

The control of continuous-time environments while actively deciding when to take costly observations in time is a crucial yet unexplored problem, particularly relevant to real-world scenarios such as medicine, low-power systems, and resource management. Existing approaches either rely on continuous-time control methods that take regular, expensive observations in time or discrete-time control with costly observation methods, which are inapplicable to continuous-time settings due to the compounding discretization errors introduced by time discretization. In this work, we are the first to formalize the continuous-time control problem with costly observations. Our key theoretical contribution shows that observing at regular time intervals is not optimal in certain environments, while irregular observation policies yield higher expected utility. This perspective paves the way for the development of novel methods that can take irregular observations in continuous-time control with costly observations. We empirically validate our theoretical findings in various continuous-time environments, including a cancer simulation, by constructing a simple initial method to solve this new problem, with a heuristic threshold on the variance of reward rollouts in an offline continuous-time model-based model predictive control (MPC) planner. Although determining the optimal method remains an open problem, our work offers valuable insights and understanding of this unique problem, laying the foundation for future research in this area.

1 Introduction

The problem of continuous control with costly observations is ubiquitous with applications in medicine, biological systems, low power systems, robotics, resource management and surveillance [Yoshioka and Tsujimura, 2020, Brunereau et al., 2012, Mastronarde and van der Schaar, 2012]. A setting that is shared across all these domains is that a decision-maker needs to continually control (e.g., chemotherapy dosing, quantities of food, data to transmit, etc.) whilst deciding when to take a costly observation (e.g., a medical computed tomography (CT) scan, measuring the population of a species, measuring bandwidth, etc.). The decision-maker s observing policy must be timely to determine whether the action plan is effective (e.g., errors in treating stage 4 cancer can be fatal [Reinhardt et al., 2019], with further application examples in Appendix B).

In many of these real-world systems (e.g., medicine, low-power systems, resource management), an offline setup is beneficial as it enables decision-makers to learn control policies without incurring excessive costs or risking adverse effects. Where an offline setup refers to learning a policy from a previously collected dataset of state-action trajectories, without interacting with the environment [Argenson and Dulac-Arnold, 2020]. The development of novel methods that can take irregular observations in continuous-time control with costly observations, within an offline setup, provides a safe and cost-effective way to improve real-world decision-making processes.

Recent work falls into three main categories. First, sensing approaches determine when to informatively observe to identify an underlying state, but are unable to continually control. Second, planning approaches only continually control and have the restrictive assumption that the observing schedule is

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

given most often at a regularly fixed observation time interval. While these methods can optimally solve many challenging tasks, with regular, frequent observations of the state, frequent observations are overly costly with many being unnecessary, and where the optimal regular observation interval is unknown.

Third, discrete monitoring approaches control and determine when to observe only for discrete time. These are fundamentally incompatible with continuous-time systems which can exhibit irregular observations in time and states that evolve on different time scales. Moreover, these rely on a discretization interval being carefully selected to use but still suffer from time discretization errors that can compound, leading to unstable and poor-performing controllers [Yildiz et al., 2021]. For example, in medicine, a crucial test may be administered too late, and therefore delay updating the treatment dosing which can be fatal when treating patients for cancer [Geng et al., 2017].

Contributions: 1 We are the first to formalize the problem of continuous-time control whilst deciding when to take costly observations. Theoretically, we show that regular observing in continuous time with costly observations is not optimal for some systems and that irregularly observing can achieve a higher expected utility (Section 2). 2 Empirically we verify this key theoretical result in a cancer simulation and standard continuous-time control environments with costly observations. We construct a simple initial method to solve this new problem, called Active Observing Control. This uses a heuristic threshold on the variance of reward rollouts in an offline continuous-time model-based model predictive control (MPC) planner (Sections 4 and 5.1). However, the optimal method is still an open problem, and we leave this for future work. We gain insight into this unique problem, uncovering how our initial method is capable of irregularly observing, and identifies observation frequencies that correlate to different stages of cancer fixed observing frequencies of clinicians. Further, we demonstrate how it can avoid discretization errors in time, thus achieving a better utility as a result (Section 5.2), and confirm its robustness to its threshold hyperparameter (Section 5.3).

We have released a Py Torch [Paszke et al., 2019b] implementation of the code at https://github.com/samholt/Active Observing In Continuous-time Control and have a broader research group codebase at https://github.com/vanderschaarlab/Active Observing In Continuous-time Control.

States & actions. For a system with state space S = Rd S and action space A = Rd A, the state at time t R+ is denoted as s(t) = [s1(t), . . . , sd S(t)] S and the action at time t R is denoted as a(t) = [a1(t), . . . , ad A(t)] A. We elaborate that a state trajectory s SR and an action trajectory a AR are both functions of time, where an individual state s(t) S or an individual action a(t) A are points on these trajectories. In practical applications, action values are usually bounded by an actuator s limits; hence we also restrict the action space to a Euclidean box A = [amin, amax]d A.

Algorithm 1 Policies ρ, π interacting with the environment

1: t1 = 0, h1 = {(t1, z(t1), a(t1))} 2: for i {1, 2, . . .} : 3: Schedule next observation: ti+1 = ti + ρ(hi) 4: Execute actions: a(t) = π(hi, t ti) for t [ti, ti+1) 5: Take an observation: hi+1 = hi {(ti+1,z(ti+1),a(ti+1))}

Environment dynamics. The dynamics of the system are given by ds(t)/dt = f(s(t), a(t)), and the system is stationary. We consider the setting where the true state trajectory s is latent and, at a given time t, only a noisy observation z(t) = s(t) + ε(t) S can be observed, where ε(t) N(0, σ2 ϵ ) is Gaussian noise with zero mean and standard deviation σϵ and ε(t) ε(t ), t, t R+. We denote as a tuple (t, z(t), a(t)) a discrete sample taken at time t, and with h = {(tj, z(tj), a(tj))} H a history of samples taken at times {tj} where H = n=0(R+ S A)n is the space of all possible histories.

Policies. Policies consist of an observing policy ρ : H R and a decision policy π : H R A. These policies interact with the environment as described in Algorithm 1. We denote with j+1 = tj+1 tj, where j is a dummy variable.

Objective. Suppose we are given a reward function r : S R to maximize, and each observation has a fixed cost c R+. Then, the utility achieved by a policy up to a time horizon T is given by

0 r(s(t), a(t), t)dt | {z } Reward R

c|{ti : ti [0, T]}| | {z } Cost C

Our objective is to find the optimal observing and decision policies ρ , π = arg maxρ,π E[U] given an offline dataset of histories D = {h(i)} but without knowing f and σϵ or having online access to the system described by f and σϵ.

2.1 Taking regular observations in time is not optimal for some systems

This is a highly non-trivial task since, in many cases, the obvious solution of taking regular observations in time is not optimal, and rather taking irregular observations achieves a higher expected utility a point we make formally in Proposition 2.1.

Proposition 2.1. For some systems, it is not optimal to observe regularly that is f, σϵ, r, c, h, h : ρ (h) = ρ (h ).

Proof. Full proof is in Appendix C. However, we present the following sketch. Consider the task of maximizing the utility U given a reward function r(s, a, t) = δ(t T) 1{s(t) a(t) > 0}, observation cost c > 0 for the specific system ds/dt = s(t) with observation noise (σϵ > 0). Intuitively this reward function only provides a positive reward of +1 at the end of the episode (t = T) if the chosen action a(T) has the same sign as the unobserved state s(T), otherwise, the reward is 0. Therefore, this r is maximized when the sign of the correct state s(t) is identified. Since this system is defined by ds/dt = s(t), we know that state solutions are of the form s(t) = s(0)et. Thus, to determine the sign of s(t), since et is always strictly positive, the sign of the state is determined by the initial condition s(0). Our observing policy can observe anywhere inside the interval of [0, T]. We prove that the optimal observing policy, ρ cannot be a regular observing policy by showing that, for each ρ(reg,δ), there exists at least one ρ(irreg,ℓ) with ℓ 2 that achieves higher expected utility, where ℓis the number of observations taken. A higher expected utility can be achieved by an irregular observing policy that takes all observations at the end of the episode, i.e. ρ (hi) = T ti for i {2, . . . , ℓ} that is when the signal-to-noise ratio (s(t)2/σ2 ϵ) is the highest, i.e. arg maxt s(0)2e2t

σ2ϵ : t [0, T], which occurs when t = T.

This motivates us to develop new continuous-time methods for solving this problem, where a solution should be able to flexibly adapt to take irregular observations when it achieves a higher expected utility. Intuitively, based on the above system, it is more informative to take an observation when the predicted state change magnitude is larger, which can occur faster for a large state velocity s indicating observations should be taken more frequently. We later show experimentally illustrative cases (Section 5.2) of how irregularly observing can lead to better performance.

3 Related work

Table 1 summarizes the key differences between Active Observing Control from related approaches to planning and monitoring in reinforcement learning (RL). Moreover, we provide an extended related work in Appendix D, which includes discussions of the benefits of offline RL, model-based RL, why model predictive control is our preferred action policy, event-based sampling, Semi-MDP methods, and Linear Quadratic Regression (LQR) & Linear Quadratic Gaussian (LQG) methods.

Sensing approaches have been proposed of when to optimally take an observation in both discrete time [Jarrett and Van Der Schaar, 2020] and continuous time [Alaa and Van Der Schaar, 2016, Barfoot et al., 2014] where their goal is to identify an underlying state. However, these approaches cannot also continuously control the system. In contrast, Active Observing Control seeks to both actively observe and control the system.

Planning approaches only focus on optimal decision policies π, and therefore observing has to be provided by a schedule, which is often assumed to be at a regular constant interval in time i+1 =

Table 1: Comparison with related approaches to planning and monitoring in RL. Plots for the corresponding state trajectories for each approach are illustrated in Figure 1. Our initial method, Active Observing Control, is the only method for continuous-time control whilst deciding when to observe with smoothly-varying states.

Approach Ref. Domain Environment Dynamics State Trajectories Policies Policy Formulation

Discrete Sensing [Jarrett and Van Der Schaar, 2020] Discrete-time si sconstant A Observing-only i + 1 = i + ρ(hi) Discrete Planning [Chua et al., 2018] Discrete-time si+1 f(si, ai) B Decision-only ai = π(hi) Discrete Monitoring [Nam et al., 2021] Discrete-time si+1 f(si, ai) C Decision & observing ai {i,...,i+ρ(hi)} = πi i(hi) Continuous Sensing [Alaa and Van Der Schaar, 2016] Continuous-time s(t) sconstant D Observing-only ti+1 = ti + ρ(hi)

Continuous Planning [Yildiz et al., 2021] Continuous-time ds(t)/dt f(s(t), a(t)) E Decision-only (w/ regular observations) a(t [ti,ti+ )) = π(hi, t ti)

Semi-continuous Monitoring [Huang et al., 2019] Continuous-time s(t [tk,tk+1)) sk (sk+1,tk+1 tk) f(sk, a(t [tk,tk+1))) F Decision & observing a(t [ti,ti+ρ(hi))) = π(hi, t ti)

Active Observing Control (Ours) Continuous-time ds(t)/dt f(s(t), a(t)) G Decision & observing a(t [ti,ti+ρ(hi))) = π(hi, t ti)

Observing Times Latent State

State Observations

Figure 1: Comparison of related approaches of an environment s latent state s(t) where green lines represent observing times and red dots observations that have observation noise.

i that is, observations are not actively planned [Yildiz et al., 2021]. Within planning approaches, there exist many discrete-time approaches [Chua et al., 2018, Mnih et al., 2013, Williams et al., 2017] and recently more continuous-time approaches [Du et al., 2020, Yildiz et al., 2021]. Specifically, Yildiz et al. [2021] presented a seminal online continuous-time model-based RL algorithm, leveraging a continuous-time dynamics model that can predict the next state at an arbitrary time s(t). However, all these methods are unable to plan when to take the next observation, in contrast to Active Observing Control which is able to determine when to take the next observation whilst planning an action trajectory.

Monitoring approaches consist of both a decision policy π and an observing policy ρ; however, existing approaches only consider the discrete-time and discrete-state setting with observation costs [Sharma et al., 2017, Nam et al., 2021, Aguiar et al., 2020, Bellinger et al., 2020a, Krueger et al., 2020, Bellinger et al., 2020b]. In particular, Krueger et al. [2020] proposed that even in a simple setting of a discrete-state multi-armed bandit, computing the optimal time to take the next observation is intractable therefore, they must rely on heuristics for their observing policy. Broadly, discrete-time monitoring methods use a discrete-time model and propagate the last observed state, often appending either a flag or a counter to the state-action tuple to indicate the time since an observation was taken [Aguiar et al., 2020, Bellinger et al., 2020b]. However, these methods cannot be applied to continuous-time environments or propagate the predicted current state and its associated prediction interval of the uncertainty associated with the state estimate. Moreover, training a policy [Aguiar et al., 2020] that decides at each state whether to observe it or not, requires a discretization of time. Whereas Active Observing Control, which determines the continuous-time interval of when to observe the next state, does not require any discretization of time, and hence it is a continuous-time method and avoids compounding time discretization errors.

One approach exists that we term semi-continuous monitoring, where Huang et al. [2019] proposes a discrete-state, constant action policy, that determines when to take an observation in continuous time of a controlled Markov Jump Process. However, this approach is limiting, as it assumes actions are constant until the next sample of the state is taken [Ni and Jang, 2022] which is clearly suboptimal, uses discrete states, and assumes a Markov Jump Process [Gillespie, 1991]. Instead, Active Observing Control is fully continuous in both states and observing times controlled by an action trajectory a, giving rise to smoothly-varying states.

4 Active Observing Control

We now propose Active Observing Control (AOC), an initial method for the defined problem of continuous-time control whilst deciding when to take costly observations. AOC, can plan action trajectories in continuous time and plan when to observe next in continuous time. The key idea is to use a continuous-time uncertainty-aware dynamics model ˆfθ to plan 1) action trajectories and 2) the next time to take an observation, such that it is informative to do so Figure 2 provides a block diagram. We adhere to the standard offline model-based setup [Lutter et al., 2021] of first learning our dynamics model (Section 4.1), and then using it at run-time with our planning method (Section 4.2).

Execute actions a(t) up to t_{j1} and then take an observation at t_j1 ti+1

1) Plan action trajectory 2) Plan next observation time

5/AFcfo3A</latexit>a(t)

Environment

(z(ti), H) = MPC( ˆf , z(ti), r, H) z(ti)

2opgxt Ni Ubgrf8ip16re Vf Wye VGp1/I4in ACp3AOHlx DHe6g AS1gw OEZXu HNe XRen Hfn Y9Fac PKZY/g D5/MH3pm M8g=</latexit>t

(z(ti)) = i+1

a(t), 8t 2 [ti, ti+1)

Figure 2: Block diagram of Active Observing Control. An uncertainty-aware dynamics model ˆfθ is learned from an offline dataset D of state-action trajectories. At run-time, planning consists of two steps: 1) The actions are determined by a Model Predictive Control (MPC) Planner, and 2) the determined action trajectory a is forward simulated to provide uncertainty σ(r(t)) on the planned path reward. We determine the continuous time ti+1 to execute the action plan a up to, such that σ(r(t)) < τ, as in Algorithm 3. We then execute a(t) t [ti, ti+1) up until the time to take the next observation z(t) at.

4.1 Learning a continuous-time uncertainty-aware dynamics model

Fundamentally the goal of offline model-based RL is to learn an accurate dynamics model that can be used for planning from an offline dataset D [Lutter et al., 2021]. In particular, we require a dynamics model that is both uncertainty aware and continuous in time; that is, it can provide a prediction uncertainty for the next state at a future time t + δ, i.e. p(z(t + δ)|(z(t), a(t), δ)). Here we use the time difference δ input to create a continuous-time dynamics model [Yildiz et al., 2021].

Model-based RL has shown the crucial importance of the performance of learning an uncertaintyaware dynamics model that captures both 1) aleatoric uncertainty due to the inherent stochasticity of the environment (e.g., observation noise and environment process noise) that is irreducible and 2) epistemic uncertainty due to the lack of data, for a given state-action space (which should vanish when trained on the limit of infinite data) [Chua et al., 2018].

We learn an uncertainty-aware dynamics model by training an ensemble of high-capacity multi-layer perceptron (MLP) neural network models that each parameterize a multivariate Gaussian distribution where the ensembling captures the epistemic uncertainty and each model individually captures the aleatoric uncertainty. We note that as shown by other works, ensembles of high-capacity neural networks outperform Gaussian process dynamics models, as they have constant-time inference scaling better to larger offline datasets, while still providing well-calibrated uncertainty estimates [Lakshminarayanan et al., 2017] and can model more complex functions inclusively non-smooth dynamics [Chua et al., 2018].

Precisely, the ensemble consists of M = 5 neural network models that output the parameters for a multivariate Gaussian distribution with diagonal covariances, each with θm parameters, i.e., ˆfθm = pθm(z(t+δ)|(z(t), a(t), δ)) = N(µθm(z(t), a(t), δ), Σθm(z(t), a(t), δ)) where the elements of the diagonal covariances are given by σ2 θm(z(t), a(t), δ), with a total of θ = {θm}M m=1 parameters of the ensemble. Moreover, we denote all the models in the ensemble as ˆfθ.

To create our ensemble of parameterized Gaussian distribution models, we train each model independently with unique initialized random weights on a unique permutation of all the samples in the offline dataset whereby each base model converges to its own local optima and is more effective than training the models on subsets of the dataset, i.e., bootstrapping [Lakshminarayanan et al., 2017]. Therefore, we minimize the negative log-likelihood for every single model separately, that is

j=1 [µθm(z(tj), a(tj), δj+1) z(tj+1)] Σ 1 θm(z(tj), a(tj), δj+1)

[µθm(z(tj), a(tj), δj+1) z(tj+1)] + log det Σθm(z(tj), a(tj), δj+1),

where δj+1 = j+1 = tj+1 tj arises from the offline dataset D. Here each parameterized Gaussian distribution model captures heteroskedastic (i.e., the output noise distribution is a function of the

input) aleatoric uncertainty. To capture the heteroskedastic epistemic uncertainty, we combine the individual models as a uniformly weighted mixture model and combine the predictions as pθ(z(t + δ)|(z(t), a(t), δ), θ) = 1 M PM m=1 pθm(z(t + δ)|(z(t), a(t), δ), θm). Therefore, we can compute the effective mean and covariance (diagonal elements) of the mixture as µ (z(t), a(t), δ) =Em[µθm(z(t), a(t), δ)]

σ2 (z(t), a(t), δ) =Em[(σ2 θm(z(t), a(t), δ) + µ2 θm(z(t), a(t), δ)] µ2 (z(t), a(t), δ)

4.2 Active Observing Control Planning

We desire to use the trained uncertainty-aware dynamics model ˆfθ to 1) plan the optimal action trajectory a to execute and 2) plan when best to take the next observation. Intuitively, we use the probabilistic dynamics model ˆfθ to plan an optimal action trajectory a and execute this until the next observing time as determined by the heuristic of when the predicted reward distribution over the planned trajectory crosses a set threshold, Figure 2. In the following, we detail these two steps.

1) Planning optimal actions. To plan an optimal action trajectory a, we specify that any model predictive controller (MPC) planner can be used that can optimize the action trajectory up to a specified time horizon H R+. We opt to use an MPC planner, as it optimizes an entire action trajectory a up to H, does not require the probabilistic dynamics model ˆfθ to be differentiable (it uses no gradients), and the reward function r can be changed on the fly allowing changing goals/tasks at run-time. We use our probabilistic dynamics model ˆfθ, with the MPC planner of Model Predictive Path Integral Control (MPPI), a zeroth-order particle-based trajectory optimizer [Williams et al., 2017], due to its competitive performance [Wang et al., 2022].

To optimize the action trajectory a up to a time horizon H, it discretizes H into smaller constant action time intervals δa R+ which are then optimized where there are K = H

δa Z+ time steps in H i.e., t(k) = ti + kδa, k {0, . . . , K 1}. It forward simulates several parallel rollouts G Z+, where the next state estimate at each time step is simulated as z(t(k+1)) = µ (z(t(k)), a(t(k)), δa) recursively. This requires a state estimate z(ti) to plan from. Therefore, we recompute the planned trajectory when we observe a new sample of the state. Furthermore, we provide MPC MPPI planner pseudocode and details in Appendix E.

2) When to observe. We desire to observe when it is most informative to do so. Therefore, we seek to determine the time interval i+1 that the action trajectory a can be followed until an observation is taken. Intuitively, we create a reward distribution over continuous time following the MPC planned action trajectory a, which we can follow until the uncertainty crosses a set threshold τ R+ a hyperparameter to be tuned. Intuitively we prefer the reward uncertainty rather than that of the state because we can achieve the task with a very certain reward despite having uncertain states. For instance, when there might be multiple ways to achieve a task and we know that our multiple action trajectories guarantee this where we can take any; however, are uncertain about which one. We empirically verify this in Appendix L.1.

To create the continuous-time reward distribution and state distribution, we use our learned probabilistic dynamics model ˆfθ to forward propagate the last observed state, according to the planned action trajectory a. As noted by others [Chua et al., 2018], computing a closed-form expression for the distribution of the expected state trajectory is generally intractable; therefore, it is common to approximate the uncertainty propagation [Chua et al., 2018, Girard et al., 2002, Candela et al., 2003]. We generate this through Monte Carlo sampling by forward simulating P Z+ particle rollouts of state trajectories. That is, for a given state particle, for the next step of δa we sample a new state particle zp(t(k+1)) N(µ (zp(t(k)), a(t(k)), δa), σ2 (zp(t(k)), a(t(k)), δa)) recursively along the planned action trajectory. This allows us to apply the reward function to the state particles, generating reward particles whereby we can compute the mean and standard deviation of the reward at each time step t(k), i.e., r(t(k)) N(µr(t(k)), σ2 r(t(k))). Therefore, using our estimate of σ(r(t(k))) over the state trajectory a allows us to determine the maximum time interval i+1 until the reward uncertainty becomes greater than a set threshold τ. Thus, the next time to observe at is

ρ(z(ti)) = max{ R+ : q

Vzp[r(ti + )] < τ}

This provides an estimate of the next observation interval i+1 that is discretized to an integer number of δa intervals. However, we seek the continuous-time observation interval i+1; therefore,

we further refine this estimate through a continuous binary search (root finding algorithm) up to a tolerance of δt R+. We outline the AOC planning algorithm pseudocode in Appendix F.

Run-time complexity. Intuitively, we find if the time taken to plan using an MPC controller is feasible in a problem setting, then AOC is feasible. As the run-time complexity of AOC is O(GK + P(K + W)) where W = (log(δa) log(δt))/ log(2) Z+ (Appendix G). Empirically we chose P = 10G, W = 4, G = 10, 000 for all experiments. Although AOC takes approximately 2.4 longer to plan, which includes both planning the actions and when to observe compared to Continuous Planning methods that only plan actions (an MPC step) it can often take fewer observations overall, leading to less time spent planning overall. AOC is still a practical method that can plan in less time than the action interval δa, even in robotics environments, where it takes 24.9 ms to plan the actions and the next observation time (Appendix G).

An important parameter is the uncertainty threshold τ hyperparameter. We find the following procedure to tune this parameter for a benchmark sufficient. For a particular environment after training its dynamics model, one episode is run, where the action trajectory a is re-planned at every δa time step and compute the median of the reward uncertainty over time for each rollout action plan, and then takes the mean of these over an episode to produce τ. This step is akin to a calibration step, where calibration is common in medicine [Preston et al., 1988] and robotics [Nubiola and Bonev, 2013]. We also include practical guidelines to select c in Appendix H.

5 Experiments and Evaluation

Benchmark environments. We use four continuous-time control environments and adapt them to add a fixed cost c for taking an observation these environments were selected as they exhibit a range of dynamical system state region regimes. First, the Cancer environment uses a simulation of a Pharmacokinetic-Pharmacodynamic (PK-PD) model of lung cancer tumor growth [Geng et al., 2017] under continuous dosing treatment of chemotherapy and radiotherapy. We note that the same underlying model has also been used by others [Lim, 2018, Bica et al., 2020, Seedat et al., 2022]. Moreover, we use the standard continuous-time control environments from the ODE-RL suite [Yildiz et al., 2021], which consists of three environments: Pendulum, Cart Pole, and Acrobot. We note that all the environments are described by a differential equation (DE) and use a DE solver to simulate the fully continuous in-time states and actions, unlike discrete environments [Yildiz et al., 2021, Brockman et al., 2016]. Furthermore, to increase realism, we add Gaussian noise to observations taken, that of N(0, σ2 ϵ ) with standard deviation σ = 0.01. Specifically, the ODE-RL starting states have the poles hanging down at the stable equilibrium of the DE system, and the goal is to swing up and stabilize the pole(s) upright to the unstable equilibrium [Yildiz et al., 2021]. Additionally, the reward function form is the same across the environments i.e., the exponential of the negative distance from the current state to the desired goal state s , while penalizing the action magnitudes and we assume we are given this simple reward function when planning (however, we also show AOC can work with a learned reward model in Appendix L.9). Furthermore, we generate an offline dataset D from these environments that has irregular times between state-action samples i+1 Exp( ), with a mean of = δa seconds. We detail all environments, including offline dataset generation in Appendix I.

Table 2: Benchmark observing policies

Benchmark Observing Policy

Discrete Planning ρ(z(ti)) = δa Discrete Monitoring ρ(z(ti)) = δa max{k : σ(ti + kδa) < τ} Continuous Planning ρ(z(ti)) = δa Active Observing Control ρ(z(ti)) = max{ : σ(ti + ) < τ}

Benchmark methods. We seek to provide more competitive benchmarks than those outlined in Table 1; therefore, we compare against the following ablations of our method, Active Observing Control. Specifically, we focus on benchmarking observing policies ρ and use the same action policy π across all the benchmark methods, that of the same MPC MPPI planner with the hyperparameters fixed. We implement two discrete-time methods by first learning a discrete-time uncertainty-aware dynamics model (an ablation of the same model and hyperparameters, without the time interval input δ to predict the next state for) on the same offline dataset D. Second, we use this discrete-time uncertainty-aware dynamics model to create two benchmark methods, that of Discrete Planning that observes at each action time interval δa and Discrete Monitoring a discrete ablation of our observing policy (Algorithm 3) that uses the same reward distribution estimation and determines the discrete action sequence to execute up until the reward uncertainty crosses the threshold τ at a discrete-time resolution of δa. Moreover,

Table 3: Normalized utilities U, rewards R and observations O for the benchmark methods, across each environment. AOC performs the best across all environments. Results are averaged over 1,000 random seeds, with indicating 95% confidence intervals. Utilities and rewards are undiscounted and normalized to be between 0 and 100, where 0 corresponds to a Random agent, and 100 corresponds to the expert, that of Continuous Planning, taking state observations at every δa.

Cancer Acrobot Cartpole Pendulum Policy U R O U R O U R O U R O

Random 0 0 0 0 13 0 0 0 0 0 50 0 0 0 0 0 50 0 0 0 0 0 50 0 Discrete Planning 91.7 0.368 91.7 0.368 13 0 87.1 1.05 87.1 1.05 50 0 83.6 0.56 83.6 0.56 50 0 87.2 0.962 87.2 0.962 50 0 Discrete Monitoring 91 0.532 85.8 0.522 5.08 0.0327 89.6 1.02 80.2 1.14 43.7 0.189 127 0.846 82.9 0.532 42.3 0.107 130 2.52 87.3 0.957 42.1 0.293 Continuous Planning 100 0.153 100 0.153 13 0 100 0.462 100 0.462 50 0 100 0.772 100 0.772 50 0 100 0.904 100 0.904 50 0

Active Observing Control 105 0.18 98.8 0.169 3.37 0.0302 107 0.911 90.8 0.878 39 0.177 151 1.54 99.5 0.774 41.1 0.196 177 2.18 98.8 0.912 35.6 0.239

we also benchmark against Continuous Planning that uses our trained continuous-time uncertaintyaware dynamics model and takes observations at regular time intervals of a multiple of δa. Finally, we also compare with a random action policy, Random that observes at each action interval δa. For ease of comparison, we list these observing policies in Table 2. We further provide the dynamics model implementation details, hyperparameters, and training details in Appendix J.

Evaluation. An uncertainty-aware dynamics model is trained for each environment on an offline dataset D of collected trajectories consisting of 106 samples. The trained dynamics model weights are then frozen when used at run-time. We record the average undiscounted utility U (Equation (1)), average undiscounted reward R and the number of observations taken O, after running one episode of the benchmark method and repeat this for 100 random seed runs for each result with their 95% confidence intervals throughout. Moreover, we normalize R and U following the standard normalization of offline-RL [Yu et al., 2020] normalized to the interval of 0 to 100, where a score of 0 corresponds to a random policy performance, and 100 to an expert which we assume here is the Continuous Planning benchmark. Furthermore, we detail these metrics and the experimental setup further in Appendix K.

5.1 Main results

We evaluated all the benchmark methods across all our environments with results tabulated in Table 3. Active Observing Control achieves high average utility U on all environments. Specifically, AOC can achieve near-optimal state control performance R while using fewer observations O compared to taking regular, frequent observations of Continuous Planning. Furthermore, it outperforms Discrete

Active Observing Control

Continuous Planning

585RH/gf P4AOTu RLw=</latexit>A)

0 10 20 t (days)

0 10 20 t (days)

0 500 1000 v (cm3)

t (obs day)

0 500 1000 v (cm3)

t (obs day)

Figure 3: Comparison of Active Observing Control against Continuous Planning on the cancer environment for one episode. Rows: A) Cancer volume v state trajectory with green vertical lines representing observation times. B) Reward uncertainty of the planned state trajectory a after an observation is taken where the red horizontal line indicates AOC s threshold used τ. C) Bar chart of the frequency of observations per state region. AOC automatically determines to observe larger cancer volumes more frequently as they are more informative, as the future state change magnitude is larger (Section 2.1) which correlates to clinician s findings [Geng et al., 2017]. Whereas with Continuous Planning the observing frequency is regular, and therefore observations can be taken when the reward is still certain, suggesting an unnecessary observation.

0.025 0.050 0.075 0.100

Figure 4: Frequency of observations per state region for Pendulum. The intuition from Section 2.1 indicates that it is more informative to sample when ϕ is larger, hence near the top goal equilibrium point ϕ 0 necessitates infrequent observations.

Table 4: Normalized utilities U and rewards R for the cancer environment using the same normalization as in Table 3. Even when Continuous Planning takes the same number of observations as determined by AOC, it still performs worse, because those observations are not well located. As Continuous Planning observations are taken blindly at regular times, rather than at more informative points.

Cancer Policy U R O

Active Observing Control 105 0.183 98.8 0.173 3.39 0.0306 Continuous Planning with O = 3 102 0.234 95.6 0.234 3 0 Continuous Planning with O = 4 103 0.226 97.3 0.226 4 0

Monitoring as it can determine when, in continuous time to observe, whereas those methods suffer from discretization errors that compound, leading to worse performance.

5.2 Insight experiments

In the following, we gain insight into why AOC performs better than Continuous Planning, i.e., better than taking regular observations in time and the importance of continuous-time monitoring as opposed to discrete-time monitoring.

How does irregularly observing achieve a higher expected utility than regularly observing? To better understand why monitoring (active observing) approaches outperform planning that regularly observes, we provide a qualitative and quantitative study.

On the one hand, qualitatively, as detailed in Figure 3, we observe that for a given threshold, AOC can still solve the Cancer environment while taking less costly observations and that AOC automatically determines to increase the observing frequency when the tumor volume v is large and less when it is smaller. Crucially, this matches what clinicians have already discovered, where they categorize the cancer volume into five discrete bins where the observing frequency is correlated to tumor volume [Geng et al., 2017, Seedat et al., 2022]. Moreover, this provides insight into how irregular observing can achieve a higher expected utility (Section 2.1), as observations could be more informative (hence taken more frequently) when the future state change magnitude is larger occurring for large v.

Furthermore, as in Figure 4 for the Pendulum environment, we observe that AOC observes infrequently the state of the pendulum when it has been swung up to the upright goal state and is balancing it there. This matches our intuition from Section 2.1, as future state changes of ϕ the angle of the pendulum are large when the pendulum is not at the equilibrium goal state, thus having ϕ > 0, which is maximally large in the swing up state trajectory phase and necessitates frequent observing, whereas balancing near the top equilibrium point, ϕ 0 necessitates infrequent observing.

On the other hand, quantitatively, as tabulated in Table 4, we make Continuous Planning take the same number of observations as AOC, however, taking them regularly. Since AOC can take better-located observations in time, it outperforms Continuous Planning, achieving a higher reward R. In summary, these experiments demonstrate that actively observing can be superior to planning in continuous time.

Why is it crucial to actively observe with continuous-time methods, rather than discrete-time methods? We compare Active Observing Control to Discrete Monitoring, which is the closest discrete counterpart (a discrete ablation of AOC) and look at when the observing times are reached, relative to their reward uncertainty estimates, to determine whether time discretization affects observing times. As shown in Figure 5, we see that Discrete Monitoring takes observations that are delayed, and therefore achieves a lower state reward R. This allows AOC to capture fast-moving states, where missing these can be catastrophic or deadly, as in the case of cancer.

0 2 4 6 8 10 Time t (days)

0 2 4 6 8 10 Time t (days) Active Observing Control R = 98.8 0.174 Discrete Monitoring R = 85.7 0.526

Figure 5: Reward uncertainty σ(r) normalized by the method specific threshold τ on the cancer environment thus the threshold to observe is at 1.0, as indicated by the red horizontal line. We observe that AOC can detect and catch when the uncertainty crosses the threshold, independent of the time discretization. In contrast, Discrete Monitoring suffers from time discretization error and takes a delayed observation, leading to compounding errors and worse performance. Here Discrete Monitoring misses observing a critical test by over a day.

2.5 5.0 7.5 10.0 12.5 15.0 17.5 Threshold ø

Active Observing Control Continuous Planning O = 13

Figure 6: Normalized utility U on the cancer environment of Active Observing Control and Continuous Planning, plotted against changing uncertainty threshold τ. We plot AOC s threshold as used in Table 3 as the red line. AOC maintains a high utility and reward over a wide range of thresholds whilst using fewer observations as τ increases, compared to Continuous Planning which takes expensive frequent observations. The threshold τ is between the minimum feasible τmin and maximum feasible τmax values.

5.3 Sensitivity analysis of τ

Active Observing Control still outperforms Continuous Planning even at variable τ on the cancer environment, Figure 6. We note that as τ is increased, this increases the observing times i+1, and hence fewer observations of the state are taken. Although we set τ = 6.67 for the cancer environment, following the procedure in Section 4, we observe robustness to other choices of τ are also suitable with decreasing reward R; however, AOC still outperforms and with a high utility U. Moreover, we tuned τmin so that frequent observations were taken to be at least equal to Continuous Planning, and then τmax to take the fewest observations, which is often minimally 1, as we start by taking an observation, or until the entire action trajectory plan is executed, and requires re-planning by taking another observation.

6 Conclusion and Future work

This work lays the foundation for the significant yet unexplored real-world problem of continuoustime control whilst deciding when to take costly observations, which has profound implications across diverse fields such as medicine, resource management, and low-power systems. Our key theoretical contribution, verified empirically is that regular observing is not optimal and that irregularly observing can achieve a higher expected utility. To demonstrate the power of active irregular observing, we provided an initial solution using a heuristic threshold based on the variance of reward rollouts in an offline continuous-time model-based model predictive control planner. However, the quest for the optimal method remains an open problem, providing fertile ground for future research.

However, this work is not without limitations. The heuristic threshold, while a robust initial solution in our experiments, may not be optimal or suitable for all scenarios. Moreover, our approach relies on the assumption that the offline dataset provides sufficient coverage of the state-action space to learn a useful enough model of the system dynamics and that the observation observes all the states plus Gaussian noise. Also, in some practical applications, the cost of observations could vary or be unknown.

Navigating these limitations illuminates exciting pathways for future research. Potential directions include: optimizing multiple observations concurrently over a planned action trajectory, jointly optimizing action trajectories and observing times, and theoretically deriving the optimal solution to reveal characteristics of optimal observing policies.

Acknowledgements. SH would like to acknowledge and thank Astra Zeneca for funding. This work was additionally supported by the Office of Naval Research (ONR) and the NSF (Grant number: 1722516). Moreover, we would like to warmly thank all the anonymous reviewers, alongside research group members of the van der Scaar lab, for their valuable input, comments and suggestions as the paper was developed where all these inputs ultimately improved the paper.

Rui Aguiar, Nikka Mofid, and Hyunji Alex Nam. Exploring optimal control with observations at a cost. ar Xiv preprint ar Xiv:2006.15757, 2020.

Ahmed M Alaa and Mihaela Van Der Schaar. Balancing suspense and surprise: Timely decision making with endogenous information acquisition. Advances in Neural Information Processing Systems, 29, 2016.

Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. In International Conference on Learning Representations, 2020.

Karl J Aström. Event based control. In Analysis and design of nonlinear control systems: In honor of Alberto Isidori, pages 127 147. Springer, 2008.

Karl Johan Åström and Bo Bernhardsson. Comparison of periodic and event based sampling for first-order stochastic systems. IFAC Proceedings Volumes, 32(2):5006 5011, 1999.

Tim D Barfoot, Chi Hay Tong, and Simo Särkkä. Batch continuous-time trajectory estimation as exactly sparse gaussian process regression. In Robotics: Science and Systems, volume 10, pages 1 10. Citeseer, 2014.

Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics, (5): 834 846, 1983.

Colin Bellinger, Rory Coles, Mark Crowley, and Isaac Tamblyn. Active measure reinforcement learning for observation cost minimization. ar Xiv preprint ar Xiv:2005.12697, 2020a.

Colin Bellinger, Andriy Drozdyuk, Mark Crowley, and Isaac Tamblyn. Balancing information with observation costs in deep reinforcement learning. 2020b.

Ioana Bica, Ahmed M Alaa, James Jordon, and Mihaela van der Schaar. Estimating counterfactual treatment outcomes over time through adversarially balanced representations. ar Xiv preprint ar Xiv:2002.04083, 2020.

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.

L Brunereau, F Bruyère, C Linassier, and J-L Baulieu. The role of imaging in staging and monitoring testicular cancer. Diagnostic and interventional imaging, 93(4):310 318, 2012.

Shelly Butler, Denise Kirschner, and Suzzane Lenhart. Optimal control of chemotherapy affecting the infectivity of hiv. Advances in mathematical population dynamics: molecules, cells, man, pages 104 120, 1997.

Joaquin Quinonero Candela, Agathe Girard, Jan Larsen, and Carl Edward Rasmussen. Propagation of uncertainty in bayesian kernel models-application to multiple-step ahead forecasting. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP 03)., volume 2, pages II 701. IEEE, 2003.

Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Pratap Tokekar, and Dinesh Manocha. Dealing with sparse rewards in continuous control robotics via heavy-tailed policies. ar Xiv preprint ar Xiv:2206.05652, 2022.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Kurtland Chua, Roberto Calandra, Rowan Mc Allister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.

Alicia Curth, Alihan Hüyük, and Mihaela van der Schaar. Adaptively identifying patient populations with treatment benfit in clinical trials. In International Conference on Machine Learning, 2023.

Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pages 465 472. Citeseer, 2011.

J Du, J Futoma, and F Doshi-Velez. Model-based reinforcement learning for semi-Markov decision processes with neural ODEs. In Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219, 2020.

Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132 20145, 2021.

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actorcritic methods. In International conference on machine learning, pages 1587 1596. PMLR, 2018.

Changran Geng, Harald Paganetti, and Clemens Grassberger. Prediction of treatment response for combined chemo-and radiation therapy for non-small cell lung cancer patients using a biomathematical model. Scientific reports, 7(1):1 12, 2017.

Daniel T Gillespie. Markov processes: an introduction for physical scientists. Elsevier, 1991.

Agathe Girard, Carl Edward Rasmussen, J Quinonero-Candela, R Murray-Smith, O Winther, and J Larsen. Multiple-step ahead prediction for non linear dynamic systems a gaussian process treatment with propagation of the uncertainty. Advances in neural information processing systems, 15:529 536, 2002.

David Ha and Jürgen Schmidhuber. World models. ar Xiv preprint ar Xiv:1803.10122, 2018.

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control. ar Xiv preprint ar Xiv:2203.04955, 2022.

David Hoeller, Farbod Farshidian, and Marco Hutter. Deep value model predictive control. In Conference on Robot Learning, pages 990 1004. PMLR, 2020.

Samuel Holt, Zhaozhi Qian, and Mihaela van der Schaar. Deep generative symbolic regression. In The Eleventh International Conference on Learning Representations, 2022a.

Samuel Holt, Alihan Hüyük, Zhaozhi Qian, Hao Sun, and Mihaela van der Schaar. Neural Laplace control for continuous-time delayed systems. In International Conference on Artificial Intelligence and Statistics, 2023.

Samuel I Holt, Zhaozhi Qian, and Mihaela van der Schaar. Neural laplace: Learning diverse classes of differential equations in the laplace domain. In International Conference on Machine Learning, pages 8811 8832. PMLR, 2022b.

Yunhan Huang and Quanyan Zhu. Infinite-horizon linear-quadratic-gaussian control with costly measurements. ar Xiv preprint ar Xiv:2012.14925, 2020.

Yunhan Huang, Veeraruna Kavitha, and Quanyan Zhu. Continuous-time markov decision processes with controlled observations. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 32 39. IEEE, 2019.

Alihan Hüyük, Zhaozhi Qian, and Mihaela van der Schaar. When to make and break commitments? In International Conference on Learning Representations, 2023.

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. Advances in Neural Information Processing Systems, 32, 2019.

Daniel Jarrett and Mihaela Van Der Schaar. Inverse active sensing: Modeling and understanding timely decision-making. In International Conference on Machine Learning, pages 4713 4723. PMLR, 2020.

Napat Karnchanachari, Miguel Iglesia Valls, David Hoeller, and Marco Hutter. Practical reinforcement learning for mpc: Learning from sparse objectives in under an hour on a real robot. In Learning for Dynamics and Control, pages 211 224. PMLR, 2020.

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Modelbased offline reinforcement learning. Advances in neural information processing systems, 33: 21810 21823, 2020.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.

Marin Kobilarov. Cross-entropy motion planning. The International Journal of Robotics Research, 31(7):855 871, 2012.

David Krueger, Jan Leike, Owain Evans, and John Salvatier. Active reinforcement learning: Observing rewards at a cost. ar Xiv preprint ar Xiv:2011.06709, 2020.

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179 1191, 2020.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.

Bruce A Larson. Principles of stochastic dynamic optimization in resource management: the continuous-time case. Agricultural Economics, 7(2):91 107, 1992.

Suzanne Lenhart and John T Workman. Optimal control applied to biological models. CRC press, 2007.

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015.

Bryan Lim. Forecasting treatment responses over time using recurrent marginal structural networks. advances in neural information processing systems, 31, 2018.

Zhichu Lin and Mihaela van der Schaar. Autonomic and distributed joint routing and power control for delay-sensitive applications in multi-hop wireless networks. IEEE Transactions on Wireless Communications, 10(1):102 113, 2010.

Michael Lutter, Leonard Hasenclever, Arunkumar Byravan, Gabriel Dulac-Arnold, Piotr Trochim, Nicolas Heess, Josh Merel, and Yuval Tassa. Learning dynamics models for model predictive agents. ar Xiv preprint ar Xiv:2109.14311, 2021.

Ruben Martinez-Cantin, Nando de Freitas, Arnaud Doucet, and José A Castellanos. Active policy learning for robot planning and exploration under uncertainty. In Robotics: Science and systems, volume 3, pages 321 328, 2007.

Nicholas Mastronarde and Mihaela van der Schaar. Joint physical-layer and system-level power management for delay-sensitive wireless communications. IEEE Transactions on Mobile Computing, 12(4):694 709, 2012.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013.

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529 533, 2015.

Thomas M Moerland, Joost Broekens, and Catholijn M Jonker. Model-based reinforcement learning: A survey. ar Xiv preprint ar Xiv:2006.16712, 2020.

Juliette Morgan, Martin I Meltzer, Brian D Plikaytis, Andre N Sofair, Sharon Huie-White, Steven Wilcox, Lee H Harrison, Eric C Seaberg, Rana A Hajjeh, and Steven M Teutsch. Excess mortality, hospital stay, and cost due to candidemia: a case-control study using data from population-based candidemia surveillance. Infection control & hospital epidemiology, 26(6):540 547, 2005.

Hyun Ji Alex Nam, Scott Fleming, and Emma Brunskill. Reinforcement learning with state observation costs in action-contingent noiselessly observable markov decision processes. Advances in Neural Information Processing Systems, 34:15650 15666, 2021.

Tianwei Ni and Eric Jang. Continuous control on time. In ICLR 2022 Workshop on Generalizable Policy Learning in Physical World, 2022.

Kenzo Nonami, Farid Kendoul, Satoshi Suzuki, Wei Wang, and Daisuke Nakazawa. Autonomous flying robots: unmanned aerial vehicles and micro aerial vehicles. Springer Science & Business Media, 2010.

Albert Nubiola and Ilian A Bonev. Absolute calibration of an abb irb 1600 robot using a laser tracker. Robotics and Computer-Integrated Manufacturing, 29(1):236 245, 2013.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019a.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019b.

Roy C Preston, David R Bacon, and Robert A Smith. Calibration of medical ultrasonic equipmentprocedures and accuracy assessment. IEEE transactions on ultrasonics, ferroelectrics, and frequency control, 35(2):110 121, 1988.

Heike Reinhardt, Petra Otte, Alison G Eggleton, Markus Ruch, Stefan Wöhrl, Stefanie Ajayi, Justus Duyster, Manfred Jung, Martin J Hug, and Monika Engelhardt. Avoiding chemotherapy prescribing errors: analysis and innovative strategies. Cancer, 125(9):1547 1557, 2019.

Jacques Richalet, André Rault, JL Testud, and J Papon. Model predictive heuristic control: Applications to industrial processes. Automatica, 14(5):413 428, 1978.

Tim Salzmann, Elia Kaufmann, Marco Pavone, Davide Scaramuzza, and Markus Ryll. Neural-mpc: Deep learning model predictive control for quadrotors and agile robotic platforms. ar Xiv preprint ar Xiv:2203.07747, 2022.

Nabeel Seedat, Fergus Imrie, Alexis Bellot, Zhaozhi Qian, and Mihaela van der Schaar. Continuoustime modeling of counterfactual outcomes using neural controlled differential equations. ar Xiv preprint ar Xiv:2206.08311, 2022.

Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. In International Conference on Learning Representations, 2019.

Sahil Sharma, Aravind Srinivas, and Balaraman Ravindran. Learning to repeat: Fine grained action repetition for deep reinforcement learning. ar Xiv preprint ar Xiv:1702.06054, 2017.

Hao Sun, Alihan Hüyük, and Mihaela van der Schaar. Accountable batched control with decision corpus. In Conference on Neural Information Processing Systems, 2023.

Richard S Sutton. Between mdps and semi-mdps: Learning, planning, and representing knowledge at multiple temporal scales. 1998.

Irena Twardowska, Sebastian Stefaniak, Herbert E Allen, and Max M Häggblom. Soil and water pollution monitoring, protection and remediation, volume 69. Springer Science & Business Media, 2007.

Volodymyr Vasyutynskyy and Klaus Kabitzsch. Event-based control: Overview and generic model. In 2010 IEEE International Workshop on Factory Communication Systems Proceedings, pages 271 279. IEEE, 2010.

Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261 272, 2020.

Jianhao Wang, Wenzhe Li, Haozhe Jiang, Guangxiang Zhu, Siyuan Li, and Chongjie Zhang. Offline reinforcement learning with reverse model-based imagination. Advances in Neural Information Processing Systems, 34:29420 29432, 2021.

Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking model-based reinforcement learning. ar Xiv preprint ar Xiv:1907.02057, 2019.

Ziyi Wang, Augustinos D Saravanos, Hassan Almubarak, Oswin So, and Evangelos A Theodorou. Sampling-based optimization for multi-agent model predictive control. ar Xiv preprint ar Xiv:2211.11878, 2022.

Grady Williams, Paul Drews, Brian Goldfain, James M Rehg, and Evangelos A Theodorou. Aggressive driving with model predictive path integral control. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pages 1433 1440. IEEE, 2016.

Grady Williams, Nolan Wagener, Brian Goldfain, Paul Drews, James M Rehg, Byron Boots, and Evangelos A Theodorou. Information theoretic mpc for model-based reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1714 1721. IEEE, 2017.

Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. ar Xiv preprint ar Xiv:1911.11361, 2019.

Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforcement learning. In International Conference on Machine Learning, pages 12009 12018. PMLR, 2021.

Hidekazu Yoshioka and Motoh Tsujimura. Analysis and computation of an optimality equation arising in an impulse control problem with discrete and costly observations. Journal of Computational and Applied Mathematics, 366:112399, 2020.

Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129 14142, 2020.

Zichen Zhang, Johannes Kirschner, Junxi Zhang, Francesco Zanini, Alex Ayoub, Masood Dehghan, and Dale Schuurmans. Managing temporal resolution in continuous value estimation: A fundamental trade-off. ar Xiv preprint ar Xiv:2212.08949, 2022.

Yaofeng Desmond Zhong, Biswadip Dey, and Amit Chakraborty. Symplectic ode-net: Learning hamiltonian dynamics with control. ar Xiv preprint ar Xiv:1909.12077, 2019.

Table of Contents

A Broader Impact Statement 17

B Further Application Examples 17

C Proof for Proposition 2.1 17

D Extended Related Work 20

E MPC MPPI Pseudocode and Planner Implementation Details 23

F Active Observing Control Planning Algorithm Pseudocode 24

G Active Observing Control Run-time Complexity 25

H Practical Guidelines to Select c 26

I Environment Selection and Details 26 I.1 ODE-RL Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 I.2 Cancer Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

J Benchmark Method Implementation Details 30 J.1 Probabilistic Dynamics Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 J.2 Observing Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 J.3 MPPI Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

K Evaluation Metrics 32

L Additional Experiments 32 L.1 Ablation of Thresholding the State Uncertainty . . . . . . . . . . . . . . . . . . 32 L.2 Sparse Reward Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 L.3 Increasing the Planning Resolution of MPC (decreasing δa) . . . . . . . . . . . 33 L.4 Discrete Time Controller Trained on an Equidistant Dataset . . . . . . . . . . . 34 L.5 Sensitivity of AOC to δa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 L.6 Evaluation of Reward Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 36 L.7 Relation Between Rewards R and the Fixed Observation Cost c . . . . . . . . . 36 L.8 Dependence on the Accuracy of the Learned Dynamics Model . . . . . . . . . . 37 L.9 Extending AOC to Work with a Learned Reward Model . . . . . . . . . . . . . 37 L.10 AOC Also Empirically Works for Non-linear State Transformations . . . . . . . 38 L.11 Further Empirical Validation in Other Real-World Scenarios . . . . . . . . . . . 38

Code. All code is available at https://github.com/samholt/Active Observing In Continuous-time Control.

A Broader Impact Statement

Our work on active observing in continuous-time control presents a novel perspective on decisionmaking in domains such as medicine, low-power systems, and resource management, with the potential to substantially reduce costs and increase efficiency. However, the adoption of these techniques also presents potential risks and ethical concerns. The dependability of these systems could raise issues if they malfunction or are compromised, potentially leading to harmful decisions in critical fields like healthcare. Hence, robustness, security, and error handling must be prioritized in implementation.

B Further Application Examples

The problem of continuously controlling continuous time environments whilst actively deciding when to take costly observations in time is ubiquitous in real-world environments, with some of these applications listed in Table 5.

Table 5: Applications of continuous control with costly observations.

Continuous time environment Ref. Continuous control Costly observation High stakes

Medical cancer chemotherapy treatment

[Geng et al., 2017,

Brunereau et al.,

Continuous chemotherapy dosing Computed tomography scan Errors can be fatal

Biological fish population management

[Yoshioka and Tsujimura, 2020] Continuous food and temperature control Fish population survey Underfeeding or overfishing can lead to extinction

Low power communication [Mastronarde and van der Schaar, 2012,

Lin and van der Schaar, 2010]

Continuous channel transmission Measure the maximum bandwidth Communication rate errors lead to loss of data

Mobile Robotics [Martinez-Cantin et al., 2007] Continuous robot control Measure the robot s position Errors can lead to damage to the robot or environment

Agricultural resource management [Larson, 1992] Continuous land allocation for livestock Measure the health of livestock Errors can lead to livestock death

Nursing surveillance [Morgan et al., 2005] Continuous treatment plans Measure the health of a patient Errors can be fatal

Water pollution monitoring [Twardowska et al.,

2007] Continuous water treatment Measure the water quality Errors can lead to water contamination

C Proof for Proposition 2.1

Proof. We will consider the system defined by

s(0) = 1 with probability 1/2 1 with probability 1/2 (3)

Note that solution to this system can be written as s(t) = s(0)et, which is independent of any actions that might have been taken. Also, consider the reward function given by

r(s, a, t) = δ(t T) 1{s a > 0} (5)

Given a history of samples h = {(tj, z(tj), a(tj))}, a posterior over s(0) can be written as

P(s(0) = 1|h) = P(s(0) = 1)P(h|s(0) = 1)

P(h) = 1 P(h)

2π e 1/2(z(tj) etj )2 (6)

P(s(0) = 1|h) = P(s(0) = 1)P(h|s(0) = 1)

P(h) = 1 P(h)

2π e 1/2(z(tj)+etj )2 (7)

Using this posterior, we define an auxiliary variable

b(h) = log P(s(0) = 1|h) log P(s(0) = 1|h) (8)

j=1 (z(tj) etj)2 + 1

j=1 (z(tj) + etj)2 (9)

Then, the optimal decision policy can be written as

π (h, t) = 1 if P(s(0) = 1|h) 1/2 b(h) 0 1 if P(s(0) = 1|h) < 1/2 b(h) < 0 (10)

Next, we derive the expected utility of the optimal decision policy given a set of observing times {tj [0, T]}n j=1. Denoting with hn = {(tj, z(tj), a(tj))}n j=1 the final history, we have

0 r(s(t), a(t), t)dt (11)

0 δ(t T) 1{s(t) a(t) > 0}dt (12)

= 1{s(T) a(T) > 0} (13) = 1{s(T) π (hn, T) > 0} (14) = 1{s(0) = 1 π (hn, T) = 1} + 1{s(0) = 1 π (hn, T) = 1} (15) = 1{s(0) = 1 b(hn) 0} + 1{s(0) = 1 b(hn) < 0} (16)

E[R|{tj}n j=1] = P(s(0) = 1 b(hn) 0|{tj}) + P(s(0) = 1 b(hn) < 0||{tj}) (17)

= P(s(0) = 1|{tj})P(b(hn) 0|s(0) = 1, {tj}) (18) + P(s(0) = 1|{tj})P(b(hn) < 0|s(0) = 1, {tj}) (19) = 1/2 P(b(hn) 0|s(0) = 1, {tj}) + 1/2 P(b(hn) < 0|s(0) = 1, {tj}) (20)

Observing that

P(b(hn) 0|s(0) = 1, {tj}) (21)

j=1 (z(tj) etj)2 + 1

j=1 (z(tj) + etj)2 0 s(0) = 1 (22)

j=1 (ε(tj) + s(tj) etj)2 + 1

j=1 (ε(tj) + s(tj) + etj)2 0 s(0) = 1 (23)

j=1 (ε(tj))2 + 1

j=1 (ε(tj) + 2etj)2 0 (24)

j=1 (e2tj + etjε(tj)) 0 (25)

= P ε N 0,Pn j=1 e2tj ε >

j=1 e2tj (26)

= P ε N 0,1 ε > r Xn

j=1 e2tj (27)

j=1 e2tj (28)

where F(x) = R (1/

2π)e (1/2)x2 is the cumulative distribution function of the standard normal

distribution, and similarly that P(b(hn) 0|s(0) = 1, {tj}) = F( q Pn j=1 e2tj), we obtain

E[U|{tj}n j=1] = E[R|{tj}n j=1] cn = 1

j=1 e2tj cn (29)

Next, we introduce two classes of observing policies. The first class consists of observing policies that observe regularly in time with an observing interval of δ such that ρ(reg,δ)(h) = δ for all h. When enrolled, the observing times for these policies can be written as tj = jδ hence

Eρ(reg,δ)[U] = 1

j=1 e2jδ c T

The second class consists of observing policies that take ℓ-many observations all at t = T such that

ρ(irreg,ℓ)(h = {(tj, z(tj), a(tj))}n j=1) = T tj if n < ℓ if n ℓ (31)

When enrolled, the observing times for these policies can be written as tj = T hence

Eρ(irreg,ℓ)[U] = 1

ℓe2T cℓ (32)

Crucially, ρ(irreg,ℓ) with ℓ 2 is distinct from all regular observing policies ρ(reg,δ) with δ > 0, since for no regular observing policy, t1 = t2. We prove that ρ cannot be a regular observing policy by showing that, for each ρ(reg,δ), there exists at least one ρ(irreg,ℓ) with ℓ 2 that achieves higher expected utility.

For δ = 0, Eρ(reg,δ)[U] whereas Eρ(irreg, T/δ )[U] is bounded for any ℓ {2, 3, . . .} (hence Eρ(irreg,ℓ)[U] > Eρ(reg,δ)[U]).

For δ T/2, we have T/δ 2 and Eρ(irreg, T/δ )[U] > Eρ(reg,δ)[U] since

Eρ(reg,δ)[U] Eρ(irreg, T/δ )[U] (33)

j=1 e2jδ c T

e2T = 0 (36)

Let the cost per observation be given as

4F(0) > 0 (37)

Then, for T/2 < δ T, we have T/δ = 1 and Eρ(irreg,2)[U] > Eρ(reg,δ)[U] since

Eρ(reg,δ)[U] Eρ(irreg,2)[U] (38)

2e2T + 2c (39)

2e2T + c (40)

2e2T + c = 0 (41)

e2T + c (42)

= 2c + c < 0 (43)

Finally, for T < δ, we have T/δ = 0 and Eρ(irreg,2)[U] > Eρ(reg,δ)[U] since

Eρ(reg,δ)[U] Eρ(irreg,2)[U] (44)

2e2T + 2c (45)

2e2T + 2c (46)

= 4c + 2c < 0 (47)

D Extended Related Work

In the following, we expand on the existing related work, in Section 3, and we provide additional discussions of the benefits of offline RL, model-based RL, why model predictive control is our preferred action policy, event-based sampling, Semi-MDP methods, and Linear Quadratic Regression (LQR) & Linear Quadratic Gaussian (LQG) methods.

Why offline RL. The setting of offline reinforcement learning consists of an agent that learns a dynamics model (i.e., a model of the environment dynamics) from a dataset of previously collected state-action trajectories from a specific environment and the agent is not allowed to interact with the environment [Wu et al., 2019]. For instance, it is challenging to apply online RL methods to real-world problem settings, as they rely on expensive trial-and-error approaches performed either online or in a realistic simulator that is not often possible. In contrast, offline model-based RL learns a model of the environment dynamics from a previously collected dataset of state-action trajectories, that is often possible. These methods then control the environment to a desired goal state using any available planning method such as training a policy [Fujimoto et al., 2018] or using

a model predictive controller (MPC) [Williams et al., 2016]. Specifically, both model-free [Kumar et al., 2019, 2020, Fujimoto and Gu, 2021] and model-based [Kidambi et al., 2020, Wang et al., 2021] approaches have been proposed for offline RL, where in general model-based methods have shown to be more sample efficient than model-free methods [Moerland et al., 2020].

Why model-based reinforcement learning? Core to Model-based RL methods is the desire to create policies for real world tasks in the offline setting where the true environment dynamics are unknown and rather learn an approximate dynamics model from an offline dataset of state-action trajectories from the environment. A key benefit of model-based reinforcement learning is that it can be significantly more sample efficient compared to model-free RL methods [Lutter et al., 2021] where model-free methods can require millions or billions of interactions with the environment to learn a good policy. Another key benefit is that learning a dynamics model of the environment enables planning with that dynamics model to optimize actions, such as in using a model predictive control planner, rather than learning a specific policy.

Principally, the dynamics model is independent of the reward and therefore allows changing reward of the planning policy enabling a planning-based policy to easily adapt to changing goal states/tasks on the fly at run-time. Conversely, a policy that was solely trained for a particular task would have to be re-trained, or a conditional policy trained when adapting goals/tasks [Lutter et al., 2021]. This is realistic as real-world continuous-time control settings of the dynamics model (e.g., often the physics/biological process of an environment [Holt et al., 2022a]) is independent of the reward function. Importantly model-based RL for conventional control tasks has been shown to be more sample-efficient than model-free methods [Wang et al., 2019, Moerland et al., 2020]. Historically, model-free methods have demonstrated expert performance on challenging tasks [Mnih et al., 2015, Lillicrap et al., 2015], and was later shown that proper tuning of model-based methods have much higher sample efficiency Ha and Schmidhuber [2018], Janner et al. [2019] and can achieve expert performance with probabilistic dynamics models on continuous control tasks [Chua et al., 2018].

Model predictive control (MPC). Core to model predictive control (MPC) is having a good dynamics model, historically using simple, first principle derived (often linear), known dynamics models [Richalet et al., 1978, Salzmann et al., 2022]. Recently, MPC Model Predictive Path Integral (MPPI) [Williams et al., 2017] is a zeroth-order particle-based (action) trajectory optimizer [Williams et al., 2017] method that can use a general, non-differentiable nonlinear dynamics model with a defined complex reward function (or cost function).

Particularly, Williams et al. [2017] demonstrated that it was suitable to be used with a learned neural network dynamics model which was used for the task of driving a vehicle around a dirt track in aggressive, actuator saturating maneuvers. In general, MPC is often computationally infeasible for optimizing actions up to long time horizons, therefore are often used to optimize actions up to a fixed time horizon, which is re-computed iteratively when needed. A key benefit of planning, and MPC, is that it can incorporate state constraint feasibility into its plans as well as easily changing the reward function for a changing goal/task at run-time. We specifically use this MPC method in this paper, as it is state-of-the-art [Wang et al., 2022] for performing MPC with a learned dynamics model, improving upon the widely used cross-entropy method (CEM) MPC planner [Kobilarov, 2012].

Hybrid MPC: Combining a powerful MPC planner with a learned policy is another exciting area developing [Argenson and Dulac-Arnold, 2020]. These existing hybrid works, work on only discretetime environment dynamics models. These include MBOP [Argenson and Dulac-Arnold, 2020], TD-MPC [Hansen et al., 2022] and DADS [Sharma et al., 2019]. We highlight that these hybrid methods are unable to determine when to observe in continuous-time and only determine discrete actions to execute.

Sensing approaches have been proposed of when to optimally take an observation in both discrete time [Jarrett and Van Der Schaar, 2020] and continuous time [Alaa and Van Der Schaar, 2016, Barfoot et al., 2014] where their goal is to identify an underlying state. However, these approaches cannot also continuously control the system. In contrast, Active Observing Control seeks to both actively observe and control the system. Besides deciding when to take an observation, some work on sensing focuses on when to stop taking observations [e.g. Hüyük et al., 2023] or what kind of observations to take [e.g. Curth et al., 2023]. Unlike these works, we focus exclusively on the timing of observations.

Planning approaches only focus on optimal decision policies π, and therefore observing has to be provided by a schedule, which is often assumed to be at a regular constant interval in time

i+1 = i that is, observations are not actively planned [Yildiz et al., 2021]. Within planning approaches, there exist many discrete-time approaches [Chua et al., 2018, Mnih et al., 2013, Williams et al., 2017, Sun et al., 2023] and recently more continuous-time approaches [Du et al., 2020, Yildiz et al., 2021, Holt et al., 2023] where these use a continuous-time dynamics model, e.g. [Chen et al., 2018, Holt et al., 2022b]. Specifically, Yildiz et al. [2021] presented a seminal online continuous-time model-based RL algorithm, leveraging a continuous-time dynamics model that can predict the next state at an arbitrary time s(t). However, all these methods are unable to plan when to take the next observation, in contrast to Active Observing Control which is able to determine when to take the next observation whilst planning an action trajectory.

Monitoring approaches consist of both a decision policy π and an observing policy ρ; however, existing approaches only consider the discrete-time and discrete-state setting with observation costs [Sharma et al., 2017, Nam et al., 2021, Aguiar et al., 2020, Bellinger et al., 2020a, Krueger et al., 2020, Bellinger et al., 2020b]. In particular, Krueger et al. [2020] proposed that even in a simple setting of a discrete-state multi-armed bandit, computing the optimal time to take the next observation is intractable therefore, they must rely on heuristics for their observing policy1.

Broadly, discrete-time monitoring methods use a discrete-time model and propagate the last observed state, often appending either a flag or a counter to the state-action tuple to indicate the time since an observation was taken [Aguiar et al., 2020, Bellinger et al., 2020b]. However, these methods cannot be applied to continuous-time environments or propagate the predicted current state and its associated prediction interval of the uncertainty associated with the state estimate. Moreover, training a policy [Aguiar et al., 2020] that decides at each state whether to observe it or not, requires a discretization of time. Whereas Active Observing Control, which determines the continuous-time interval of when to observe the next state, does not require any discretization of time, and hence it is a continuous-time method and avoids compounding time discretization errors.

One approach exists that we term semi-continuous monitoring, where Huang et al. [2019] proposes a discrete-state, constant action policy, that determines when to take an observation in continuous time of a controlled Markov Jump Process. However, this approach is limiting, as it assumes actions are constant until the next sample of the state is taken [Ni and Jang, 2022] which is clearly suboptimal, uses discrete states, and assumes a Markov Jump Process [Gillespie, 1991]. Instead, Active Observing Control is fully continuous in both states and observing times controlled by an action trajectory a, giving rise to smoothly-varying states.

Event-based sampling a similar related work area in the control community [Åström and Bernhardsson, 1999, Aström, 2008] , which addresses the similar problem of controlling a system, of only taking a full state observation when an external input event occurs, and then providing a control action. To create this event, it assumes part of the state is continuously observed, and often an event is defined when this part of the state (or a function of it) crosses a fixed threshold (e.g., a physical sensor crossing a threshold) [Vasyutynskyy and Kabitzsch, 2010]. This finds multiple uses in problems such as electronic analog control systems for audio and mobile telephone systems [Åström and Bernhardsson, 1999], and battery capacity optimization in wireless devices [Vasyutynskyy and Kabitzsch, 2010]. However, this is different from our proposed problem of continuous-time control whilst deciding when to take costly observations as (1) Event-based sampling assumes part of the state is continuously observed. Often in our environments, it is not feasible to observe part of the state continuously (e.g., imaging a cancer tumor volume or performing a medical test) or it is prohibitively expensive to continually observe at a high-frequency part of the state (similar to Continuous Planning approaches). (2) The event input (the time to take the next observation) is given as a control input to the agent. However, it is precisely the more difficult problem we tackle of coming up with an observing policy that decides when to take the next observation. (3) The event is often defined by a human in advance and is a function of the current partial state. General environments may not have partial state spaces that are predictive of when to observe next, such as a robotic navigation task in two dimensions and only continuously observing one dimension.

1Krueger et al. [2020], focus on the simpler problem of multi-armed bandits (MAB), where there is a cost to observe the reward. In the MAB setting, it is possible to formulate it as an RL PO-MDP with one state. The underlying static state (the fixed reward distributions of each bandit) is unknown and must be determined by taking successive costly observations (paying a cost to observe a reward). Therefore, Krueger et al. [2020] s statement that the optimal algorithm for their simple MAB setting is intractable applies to the problem of active observing in continuous-time control.

Semi-MDP methods is another similar related field, where it extends the MDP problem to include options that define temporally abstracted action sequences [Sutton, 1998]. However, some distinct differences prevent using a semi-MDP to address our continuous-time control whilst deciding when to take costly observations problem. These differences include (1) Semi-MDP is still an underlying discrete MDP with discrete state transitions, whereas our problem formulation focuses on continuoustime systems that can handle continuous actions and states. (2) Semi-MDP formulation does not involve an observation cost, whereas our problem formulation does.

LQR/LQG methods. Similar related methods from control exist, as Zhang et al. [2022] propose a discrete planning method that works only for linear quadratic regulator (LQR) systems and optimizes the observing frequency for a given total budget of observations. Moreover, Huang and Zhu [2020] addresses an infinite horizon discrete-time linear quadratic Gaussian (LQG) control problem with observation costs. This is notably different from the formulation proposed for the problem of continuous-time control with observation costs as (1) Huang and Zhu [2020] only applies to linear systems that are discrete in time and have an infinite time horizon; our formulation applies to nonlinear systems that are continuous in time and have a fixed time horizon. (2) Huang and Zhu [2020] assume their full system dynamics model is known; ours makes no such assumptions and only assumes access to an offline collected dataset of (possibly irregular in time) state-action trajectories to learn a dynamics model, which is more applicable to real-world environments.

E MPC MPPI Pseudocode and Planner Implementation Details

To plan an optimal action trajectory a, we specify that any model predictive controller (MPC) planner can be used that can optimize the action trajectory up to a specified time horizon H R+. We opt to use an MPC planner, as it optimizes an entire action trajectory a up to H, does not require the probabilistic dynamics model ˆfθ to be differentiable (it uses no gradients), and the reward function r can be changed on the fly allowing changing goals/tasks at run-time. Moreover, model-based RL with MPC planners has achieved comparable performance to model-free RL methods [Chua et al., 2018, Lutter et al., 2021]. We use our probabilistic dynamics model ˆfθ, with the MPC planner of Model Predictive Path Integral Control (MPPI), a zeroth-order particle-based trajectory optimizer [Williams et al., 2017], due to its competitive performance [Wang et al., 2022].

To optimize the action trajectory a up to a time horizon H, it discretizes H into smaller constant action time intervals δa R+ which are then optimized where there are K = H

δa Z+ time steps in H i.e., t(k) = ti + kδa, k {0, . . . , K 1}. It forward simulates a number of parallel rollouts G Z+, where the next state estimate at each time step is simulated as z(t(k+1)) = µ (z(t(k)), a(t(k)), δa) recursively. This requires a state estimate z(ti) to plan from; therefore, we recompute the planned trajectory when we observe a new sample of the state. This optimizes the action trajectory by

a (t( )) = arg max a(t( )) Ez

k=0 r(z(t(k)), a(t(k)), t(k))

MPPI is also a Monte Carlo based sampler, and thus increasing the number of rollouts improves the input trajectory optimization, however, scales the run-time complexity as O(GK). The standard MPPI algorithm [Williams et al., 2017] is used with our probabilistic dynamics model ˆfθ. We articulate the MPPI pseudocode for the action trajectory optimizer in Algorithm 2.

Algorithm 2 MPPI-Trajectory-Optimization

Input: State observation z(ti), Pre-trained probabilistic dynamics model ˆfθ, Reward function r, Time Horizon H, Action time interval δa, Number of parallel rollouts G, Noise covariance Σ, Hyper parameters λ, Action max amax, Action min amin. K H

δa R(G) 0(G) This holds our G trajectory returns. A(G,K) 0(G,K) This holds our G action trajectories of length K. A (G,[0:K 1]) 0(G,K) This holds G action trajectories of length K that are perturbed by noise. ε(G,K) 0(G,K) This holds the generated scaled action noise.

for g = 0, . . . , G 1 :

z(t(0)) z(ti) Sample G trajectories over the horizon K. for k = 0, . . . , K 1 :

ε(g,k) N(0, Σ) Sample action noise. A (g,k) A(g,k) + ϵ(g,k) Perturb action by noise. A (g,k) min(max(A (g,k), 1), +1) Clip normalized perturbed noise (to bound actions to their limits).

ε(g,k) A (g,k) A(g,k) Update noise after bounding, so we do not penalize clipped noise.

for k = 0, . . . , K 1 :

z(t(k+1)) µ (z(t(k)), A (k) amax, δa) Sample next state from pre-trained probabilistic dynamics model ˆfθ.

R(g) R(g) + r(z(t(k+1)), A(g,k)) λA(g,k)T Σ 1ε(g,k) Accumulate the current state reward.

κ ming[R(g)]

PG 1 g=0 exp( 1

λ (R(g) κ))ε(g,k) PG 1 g=0 ( 1

λ (R(g) κ)) amax, k [0, K 1] Generate the return-weighted

trajectory update. Return: T

F Active Observing Control Planning Algorithm Pseudocode

The Active Observing Control planning algorithm pseudocode is outlined in Algorithm 3. In particular, δt is the continuous search (root finding algorithm) tolerance that is used when searching with binary search for the continuous time duration that the computed action trajectory can be followed for, which occurs when the standard deviation of the reward crosses the threshold τ. We note that all numerical root-finding algorithms generally involve a search tolerance or a similar stopping criterion. The δt tolerance ensures that the binary search algorithm stops evaluations once a solution time is found that is close enough to the actual true value. Another way to view the search tolerance δt, is that the numerical precision that the search value is correct up to.

Algorithm 3 Active Observing Control Policy

Input: State observation z(ti), Pre-trained probabilistic dynamics model ˆfθ, Reward uncertainty threshold τ, Reward function r, Time Horizon H, Search tolerance δt, Action time interval δa. a MPC( ˆfθ, z(ti), r, H) Plan action trajectory. for k = 0, . . . , K 1 :

t(k) ti + kδa for p 1, . . . , P :

zp(t(k+1)) N(µ (z(t(k)), a(t(k)), δa), σ2 (z(t(k)), a(t(k)), δa)) if p

Vp[r(zp(t(k+1)), a(t(k+1)))] > τ : t(lower) t(k), t(upper) t(k+1) while t(upper) t(lower) > δt :

t(mid) (t(upper) + t(lower))/2 for p = 1, . . . , P :

zp(t(mid)) N(µ (z(t(k)), a(t(k)), t(mid) t(k)), σ2 (z(t(k)), a(t(k)), t(mid) t(k))) if p

Vp[r(zp(t(mid)), a(t(mid)))] > τ : t(upper) t(mid) else :

t(lower) t(mid) Return a(t) t [ti, t(mid)) Return a(t) t [ti, t K)

G Active Observing Control Run-time Complexity

Active Observing Control s run-time complexity of Algorithm 3 is O(GK + P(K + W)) where W = (log(δa) log(δt))/ log(2) Z+. As W is determined by a continuous binary real line search, on an interval of δa up to a resolution of δt.

Proof. By definition binary search halves the search interval δa every iteration therefore

(1/2)W δa <δt (48)

(1/2)W < δt

W log(2) < log(δt) log(δa) (50)

W >log(δt) log(δa)

log(2) (51)

W = (log(δa) log(δt))/ log(2) (52)

Empirically we chose P = 10G, W = 4, G = 10, 000, K = 20 for all experiments therefore the dominating scaling parameter is G the number of particle rollouts. We observe in Table 6 that AOC takes approximately 2.4 longer to plan, which includes both planning the actions and when to observe, compared to Continuous Planning which only plans actions (an MPC step). In the case of the Cancer environment, it can take fewer observations and therefore spend less time planning overall. Moreover, AOC is still practical in environments that require fast decision-making, i.e., robotics, as AOC can still produce a plan in less time than the planning action interval δa for the Cartpole environment, as shown in Table 7. Therefore, AOC is a practical method that can be utilized across different environments (planning cancer treatment plans with high accuracy, whilst being fast enough to control continuous robots). Furthermore, we note that MPC has demonstrated scaling to high dimensional environments (Chua et al. [2018] with G = 20, Lutter et al. [2021] with G=500).

Table 6: Normalized utilities U, rewards R, observations O, average time taken to plan policy method T (Average), total time spent planning policy in episode T (Total) for the Cancer environment using the same normalization as in Table 3. Even though Active Observing Control takes 2.4 longer to plan both the actions and the observing time T (Average), compared to Continuous Planning that only plans the actions AOC can achieve less total planning time T (Total) (hence less total compute) because it takes fewer observations in total here Continuous Planning takes 1.5 longer total planning time T (Total).

Cancer Policy U R O T (Average) (ms) T (Total) (ms)

Active Observing Control 105 98.8 3.4 31.1 105 Continuous Planning 100 100 13 12.8 166

Table 7: Normalized utilities U, rewards R, observations O, average time taken to plan policy method T (Average), total time spent planning policy in episode T (Total) for the Cartpole environment using the same normalization as in Table 3. Even though Active Observing Control takes 1.7 longer to plan both the actions and the observing time T (Average), compared to Continuous Planning that only plans the actions both policies can be run at real-time as the action interval δa = 100 ms for the Cartpole environment making AOC a practical method.

Cartpole Policy U R O T (Average) (ms) T (Total) (ms)

Active Observing Control 140 100 43.2 24.9 1073 Continuous Planning 100 100 50 14.0 700

H Practical Guidelines to Select c

The observation cost c should be decided based on the real-world application at hand and include any human preferences in that application if applicable. In real-world systems (with resource constraints), this cost might correspond to actual monetary cost, computational expense, or energy consumption. Further, if this cost involves a human, for example, a patient receiving medical treatment, they may have preferences for treatment, their health impact (e.g., chemotherapy side effects), and or the number, frequency, and timing of treatments that could be included in the cost c. Also, there can often be a trade-off between control performance and the number of observations taken, and one way to trade this off is to tune c.

I Environment Selection and Details

In the following we discuss our reasoning for why we selected the continuous-time environments. Principally, a continuous-time environment is defined by an underlying differential equation (DE) system (e.g., physics or biological system), which allows for the environment state trajectories to be sampled and simulated at any continuous-time, which is unlike discrete environments [Yildiz et al., 2021, Brockman et al., 2016]. We used an ordinary differential equation solver [Virtanen et al., 2020] to simulate all environments, using an Euler solver at a time resolution of δsim as indicated by the environments parameters. We selected the standard continuous-time control environments from the ODE-RL suite 2 [Yildiz et al., 2021], which consists of three well known environments: Pendulum, Cart Pole, and Acrobot. Furthermore, as detailed in the following, we implemented a Cancer environment that uses a simulation of a Pharmacokinetic-Pharmacodynamic (PK-PD) model of lung cancer tumor growth [Geng et al., 2017] under continuous dosing treatment of chemotherapy and radiotherapy. Where the same underlying model has also been used by others [Lim, 2018, Bica et al., 2020, Seedat et al., 2022]. We further adapt all the environments to have a fixed cost when taking a sample of the state. To learn a continuous-time dynamics model, it is a standard assumption to assume the dataset of state-action trajectories is sampled irregularly in time [Yildiz et al., 2021,

2Where the ODE-RL suite of the environments used can be downloaded freely from https://github.com/ cagatayyildiz/oderl.

Chen et al., 2018]. First let us justify the reasoning behind our choice of observing irregularly in time state-action trajectories from these environments and why existing offline datasets are not suitable.

Why we cannot use an offline dataset of agent trajectories from a discrete-time environment. There exist standard discrete-time offline datasets [Fu et al., 2020] of state-action trajectories of agents interacting with discrete-time environments where the state-action trajectories are sampled at regular time intervals j+1 = j. Let us hypothetically imagine that there exists a regularly-sampled j+1 = j in time state-action trajectory offline dataset. Is it possible to sample the trajectories irregularly? Two approaches come to mind. 1) One could imagine using some form of interpolation (e.g., splines or similar) to interpolate to irregular time steps between states and actions, however doing so would lead to errors in the sampled state-action trajectories in comparison to the true irregularly sampled state-action trajectories at those non-uniform time points. These errors could compound over a dynamics model being trained on these; therefore, we highlight that this approach is unsuitable. 2) Another approach would be to start with a regularly sampled state-action trajectory and then sub-sample state-action times from that regular grid of collected times. However, again this approach is unsuitable, as often environments are captured at run-time with a particular regular observation time interval where only observing discrete multiples of this (i.e., n , n Z+) would lead to gaps between observations, where the mean of the observation intervals would be larger than that of the environments nominal run-time observation interval Sub-sample > Original Trajectory. Therefore, we note that this becomes a different problem, as there is less information in the state-action trajectories with large observation interval gaps.

Offline dataset generation D. Rather, to mitigate both of these issues above (1,2) we prefer to collect an offline dataset ourselves by observing irregularly in time state-action trajectories, where the time interval between the state-action time points is sampled from an exponential distribution i+1 Exp( ), with a mean of = δa seconds, and state-action values are randomly sampled, collecting a total of 1e6 samples [Yildiz et al., 2021]. Furthermore, to increase realism, we add Gaussian noise to observations taken, that of N(0, σ2 ϵ ) with standard deviation σ = 0.01.

In all environments the actions are continuous and bounded to a defined range [amin, amax]. Here we assume a given state s(t) is composed of the position state q and their respective velocities q, i.e., s(t) = {q(t), q(t)}. Furthermore, each environment uses the reward function of the exponential of the negative distance from the current state to the goal state q , whilst also penalizing the magnitude of action, and we assume that we are given this reward function when planning as we often know the desired goal state q and our current state q. Therefore, the reward function for the environments has the following form: r({q(t), q(t)}, a(t)) = exp ||q(t) q ||2 2 b|| q(t)||2 2 v||a(t)||2 2 (53) Where b and v are specific environment constants [Yildiz et al., 2021]. Specifically, when we use our MPC planner we observe that it plans better without the exponential operator, and therefore remove it, and use the following reward function throughout, r({q(t), q(t)}, a(t)) = ||q(t) q ||2 2 b|| q(t)||2 2 v||a(t)||2 2. Yildiz et al. [2021] set the environments parameters of b, v to penalize large values and enforce exploration from trivial states, and we use their same values which are also tabulated in Table 8. For all environments we assume a fixed cost of c = 50 when observing a sample of the state.

Table 8: Environment specification parameters.

Base Environment b v amax s Init q

Pendulum 1e 2 1e 2 [2] [0.1, 0.1] [0, L] Cartpole 1e 2 1e 2 [3] [0.05, 0.05, 0.05, 0.05] [0, 0, L] Acrobot 1e 4 1e 2 [4, 4] [0.1, 0.1, 0.1, 0.1] [0, 2L] Cancer 1e 3 1e 2 [5, 2] [1138, 0] [0, 0]

I.1 ODE-RL Suite

The starting state, for all these tasks, is hanging down, and the goal is to swing up and stabilize the pole(s) upright [Yildiz et al., 2021] in each environment.

Specifically, we set δa = 0.1 seconds, and simulate each ODE system at a time resolution of δsim = 0.01 seconds, up until the episode length of T = 5 seconds. The goal states for these

ODE-RL suite environments are when the pole(s), each of length L are fully upright, such that their x, y co-ordinates of the tip of the pole reach the goal state. That is where q is: [0, L] for the Pendulum environment, [0, 0, L] for the Cartpole environment where the additional 0 is zero for the cart s x location and in Acrobot is [0, 2L] as there are two poles connected to each other. Furthermore, upon restarting the environment the initial state s is sampled from the uniform distribution of s0 U[ s Init, s Init] [Yildiz et al., 2021], then the θ states are added with set angle such that the pole(s) are pointing downwards (i.e., Cartpole θ = θInit + π).

In the following we describe each of the ODE-RL environments introducing the environment with a screenshot figure.

I.1.1 Cartpole (swing up) Environment

Figure 7: Screen shots of the Cartpole environment. The task is to swing up a pole attached to a cart that can move horizontally along a rail. In the following we see: (a) the starting downward state with an additional small amount of perturbation, (b) the optimal trajectory solution found by a policy that scores R 100% including our Active Observing Control policy and (c) the final goal state that has been reached, that is, to swing up the pole and stabilize it upwards which is a challenging control task. We note that the control actuator is bounded and limited, and the force is such that the Cartpole cannot be directly swung up rather it must gain momentum through a swing and then stabilize this swing to not overshoot when stabilizing the pole upwards in the goal position, as indicated when the tip of the pole reaches the centre of the red target square. Furthermore, we note this environment is an underactuated system.

We can see in Figure 7, an illustration of the starting state Figure 7 (a) with a small perturbed random initial start. Here a pole is attached to an un-actuated joint to a cart that moves along a frictionless track [Barto et al., 1983]. The pendulum starts in the downward position Figure 7 (a) and the goal is to swing the pendulum upwards and then balance the pole upright by applying forces to the left or right horizontal direction of the cart. This environment has the state of [x, x, θ, θ] and a corresponding observation of [x, x, cos(θ), sin(θ), θ], where θ ( π, π) is measured from the upward vertical of the pole. We note that this environment is an underactuated system, as it has two degrees of freedom [x, θ], however only the carts position is actuated, leaving θ indirectly controlled. Additionally, for this Cartpole environment only, created a more competitive Random policy than randomly executing actions, that of applying no actions to keep the pole in the same starting position. This achieves a significantly higher reward than a pure Random policy, as the Cartpole environment is x R unbounded and a Random policy leads to random drift that significantly moves away from the goal position during the episode length, i.e., |x| 0.

I.1.2 Pendulum Environment

We can see in Figure 8, an illustration of the starting state Figure 8 (a) with a small perturbed random initial start. Here a pole (pendulum) is attached to a fixed point at one end with the other end being free [Barto et al., 1983, Yildiz et al., 2021]. The pendulum starts in the downward position Figure 8 (a) and the goal is to swing the pendulum upwards and then balance the pole upright by applying torques about the fixed point, as indicated in the Figure 8 with a visualization showing the torque direction and magnitude based on the size of the arrow. This environment has the state of [θ, θ] and a corresponding observation of [sin(θ), cos(θ), θ].

Figure 8: Screen shots of the Pendulum environment. The task is to swing up the pole (pendulum). In the following we see: (a) starting downward state with an additional small amount of perturbation, (b) the optimal trajectory solution found by a policy that scores R 100% including our Active Observing Control policy and (c) the final goal state that has been reached, that is, to swing up the pole and stabilize it upwards. We note that the control actuator is bounded and limited, and the force is such that the Pendulum cannot be directly swung up rather it must gain momentum through a swing and then stabilize this swing to not overshoot when stabilizing the pole upwards in the goal position.

Figure 9: Screen shots of the Acrobot environment. The task is to swing up the 2-link pendulum. In the following we see: (a) starting downward state with an additional small amount of perturbation, (b) the optimal trajectory solution found by a policy that scores R 100% including our Active Observing Control policy and (c) the final goal state that has been reached, that is, to swing up the 2-link pendulum and stabilize it upwards. We note that the control actuator is bounded and limited, and the force is such that the 2-link pendulum cannot be directly swung up rather it must gain momentum through a 2-link swing and then stabilize this swing to not overshoot when stabilizing the 2-link pendulum upwards in the goal position.

I.1.3 Acrobot Environment

We can see in Figure 9, an illustration of the starting state Figure 9 (a) with a small perturbed random initial start. It is a 2-link pendulum with the individual joints actuated [Brockman et al., 2016]. The 2-link pole starts in the downward position Figure 9 (a) and the goal is to swing the 2-link pendulum upwards and then balance the pole(s) upright by applying torques about their fixed points. This environment has the state of [θ1, θ1, θ2, θ2] and a corresponding observation of [sin(θ1), cos(θ1), θ1, sin(θ2), cos(θ2), θ2]. Here the Acrobot environment is fully actuated, as no method has been able to solve the underactuated balancing problem [Yildiz et al., 2021, Zhong et al., 2019].

I.2 Cancer Environment

We implemented a Cancer environment, that uses a simulation of a Pharmacokinetic Pharmacodynamic (PK-PD) (a bio-mathematical model that represents dose-response relationships)

model of lung cancer tumor growth [Geng et al., 2017] under continuous dosing treatment of chemotherapy and radiotherapy. Where the same underlying model has also been used by others [Lim, 2018, Bica et al., 2020, Seedat et al., 2022]. Here, the tumor volume V at time t after diagnosis is defined as

| {z } Tumorgrowth

βc C(t) | {z } Chemotherapy

(αrd(t) + βrd(t)2) | {z } Radiotherapy

2 + c(t) (55)

Where the state space is [V (t) R+, C(t) R+], with the V (t) being the cancer volume cm3, and C(t) being the chemotherapy concentration in the patient. Whereas the action space inputs are [c(t) [0, 5], d(t) [0, 2]], the chemotherapy input dose c(t) and radiotherapy dose d(t) which are given by the policy. Specifically, ρ, K, αr, βr, βc are effect parameters [Geng et al., 2017], and are defined as in Table 9.

Table 9: PK-PD model parameters for the Cancer environment.

Model Variable Parameter Value

Tumor growth Growth parameter ρ 1.45 10 2 Carrying capacity K 30.0

Radiotherapy Radio cell kill (α) αr 0.0398 Radio cell kill (β) βr Set s.t. α/β = 10 Chemotherapy Chemo cell kill βc 0.028

As detailed above, the chemotherapy drug concentration state C(t) follows an exponential decay relationship with a half-life of one day, and c(t) represents a dose of Vinblastine up to the concentration of 5.0mg/m3 per time in a day. Whereas the radiotherapy concentration d(t) represents 2.0Gy fractions of radiotherapy, where Gy is the Gray ionizing radiation dose. Particularly, we set δa = .4 days, and simulate each ODE system at a time resolution of δsim = 0.04 days, up to an episode length of T = 25 days.

The goal state here is to reduce the cancer tumor volume to zero, alongside zero chemotherapy concentration in the patient, i.e. q = [0, 0]. Moreover, upon restarting the environment the initial state s is sampled from the uniform distribution of s0 U[[1120, 0], [1138, 0]]. Here we assume the patient starts with stage four cancer, the largest stage, with a tumor diameter of approximately 13cm diameter, assuming that the tumor is spherical. Furthermore, we use a larger growth parameter ρ than the nominal one reported by Geng et al. [2017] to model aggressive cancer tumors which is of high interest.

J Benchmark Method Implementation Details

All benchmark policies consist of an observing policy and an action policy. To ensure competitive benchmarks, each of the following benchmarks are specific ablations of our method, Active Observing Control. Therefore, for a given trained probabilistic dynamics model, the same action policy planner was used across all the benchmark methods, that of the same MPC MPPI planner with the hyperparameters fixed to the below values.

J.1 Probabilistic Dynamics Model

To train our continuous-time and discrete-time probabilistic dynamics model we use a deep ensemble of fully connected multi-layer perceptions [Lakshminarayanan et al., 2017], adapted for multidimensional inputs [Chua et al., 2018]. Specifically, for each individual model in the ensemble, we use a 3-layer multilayer perceptron (MLP) of 256 units, with tanh activation functions. We also use the negative log-likelihood loss to train each model in the ensemble separately training each model

for the same number of epochs with a different random seed, where the ensemble has M = 5 total models. We use Xavier weight initialization and output the log variance, following the setup by Chua et al. [2018]. All dynamics models are implemented in Py Torch [Paszke et al., 2019a], and trained with an Adam optimizer [Kingma and Ba, 2017] with a learning rate of 1e-4. For each environment we train two dynamics models on the offline dataset, that of a continuous-time probabilistic model using the time delta input δ, and a discrete-time probabilistic model which is the exact same model architecture without the additional time delta input δ.

All dynamics models predict the next state difference, z(t + δ), and we construct the state as z(t + δ) = z(t) + z(t + δ) following the standard suggestion of [Deisenroth and Rasmussen, 2011].

For training, using the whole collected dataset we pre-process this by a standardization step, to make each dimension of the samples have zero mean and unit variance (by taking away the mean for each dimension and then dividing by the standard deviation for each dimension) we also use this step during run-time for each dynamics model. Furthermore, we train all the baseline models on all the samples collected in the offline dataset (all samples are training data) Lakshminarayanan et al. [2017]. Specifically, we train all the dynamics models on a given offline dataset by training each model until convergence, each for at least 12 hours.

J.2 Observing Policies

We implement two discrete-time methods. First by learning a discrete-time uncertainty-aware dynamics model (an ablation of the exact same model and hyperparameters, without the time interval input δ to predict the next state for) on the same offline dataset D. Second, we use this discrete-time uncertainty-aware dynamics model to create two benchmark methods, that of Discrete Planning that samples the state at each action time interval δa and Discrete Monitoring a discrete ablation of our observing policy (Algorithm 3) that uses the same reward distribution estimation and determines the discrete action sequence to execute up until the reward uncertainty crosses the threshold τ at a discrete-time resolution of δa.

Moreover, we also benchmark against Continuous Planning that uses our trained continuous-time uncertainty-aware dynamics model and takes a sample of the state at regular time intervals of a multiple of δa. Finally, we also compare with a random action policy, Random that samples the state at each action interval δa.

For each policy method, we tune the hyperparameter τ is one exists following the procedure outlined in Section 4.2. We then keep this fixed and constant throughout all run-time experiments for each policy method for each environment unless explicitly noted that we modify it, as in Section 5.3. For all policies we limit the minimum i+1 to be δa (as actions take time to execute and interact with the environment, therefore we want to avoid observing the same state again immediately, for fast-moving states), and the maximum H the MPC planned action horizon.

J.3 MPPI Implementation

We use the MPPI algorithm, with pseudocode and is further described in Appendix E. Specifically, as is recommended by Lutter et al. [2021] we optimized the MPPI hyperparameters through a grid search with the continuous-time probabilistic dynamics model for a single environment setting, that of the Cartpole environment, and fix these for planning with all the learned dynamics models throughout. Particularly, our final optimized hyperparameter combination is N = 20, M = 1, 000, λ = 0.01, σ = 1.0. Where Σ the MPPI action noise is defined as:

[σ2] if d A = 1 σ2 0.5σ2

if d A = 2 (56)

Where the Cartpole and Pendulum environments have d A = 1 and Acrobot and Cancer environments has d A = 2. These hyperparameters were found by searching over a grid of possible values, which is detailed in Table 10.

Table 10: MPPI hyperparameter grid search sweep values.

Hyperparameter Grid values searched over

K {1, 2, 4, 8, 16, 20, 40, 50, 60, 70, 80, 90, 100, 128, 256, 512, 1024, 2048} G {1, 2, 4, 8, 16, 20, 40, 50, 60, 70, 80, 90, 100, 128, 256, 500, 1000, 2000, 4000, 8000} λ {0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5, 0.8, 1.0, 1.5, 2.0, 10.0, 100.0, 1000.0} σ {0.00001, 0.0001, 0.001, 0.01, 0.1, 0.5, 0.8, 1.0, 1.5, 2.0, 10.0, 100.0, 1000.0}

K Evaluation Metrics

An uncertainty-aware dynamics model is trained for each environment on an offline dataset D of collected trajectories consisting of 1e6 samples. The trained dynamics model weights are then frozen when used at run-time. We record the average undiscounted utility U (Equation (1)), average undiscounted reward R and the number of observations taken O, after running one episode of the benchmark method and repeat this for 100 random seed runs for each result with their 95% confidence intervals throughout. We use a fixed observation cost of c = 50 throughout. Moreover, we normalize R and U following the standard normalization of offline-RL [Yu et al., 2020] normalized to the interval of 0 to 100, where a score of 0 corresponds to a random policy performance, and 100 to an expert which we assume here is the Continuous Planning benchmark. Furthermore, we also track the metric of total planning time taken to plan the next action and time to schedule a sample and perform all experiments using a single Intel Core i9-12900K CPU @ 3.20GHz, 64GB RAM with a Nvidia RTX3090 GPU 24GB.

L Additional Experiments

L.1 Ablation of Thresholding the State Uncertainty

Intuitively we threshold the variance of the value function rather than the variance of the state as it is possible to achieve a task with a very certain value despite having uncertain states. For example, when there exist multiple ways to achieve a task and we know that multiple action trajectories guarantee this where we can take any; however, we are uncertain about which one.

We performed an additional ablation experiment where we threshold the variance of state instead and compare the utility against Active Observing Control that thresholds the variance of the predicted reward instead.

Table 11: Ablation experiment of using a threshold on the variance of state instead of the predicted reward, for the Cancer environment.

Policy Utility U

Active Observing Control (Reward Uncertainty Threshold) 106 0.789 Active Observing Control (State Uncertainty Threshold) 103 1.6

For each policy, we tuned the threshold following the tuning setup, in Section 4.2. Here we evaluated the utility metric over 10 random seeds for each policy on the Cancer environment. As tabulated in Table 11, it is possible to threshold on the state variance for an acceptable performing policy, however, it is preferable and more intuitive to threshold the variance of the predicted reward uncertainty.

Furthermore, we performed an additional ablation experiment where we threshold the variance of the state instead and compare this to the utility of Active Observing Control that thresholds the variance of the predicted reward instead, on the Cartpole environment, as tabulated in Table 12.

We again tuned each observing policy s threshold following the tuning setup in Section 4.2. Here we evaluated the utility metric over 100 random seeds. Empirically we observe that is indeed possible to threshold the state variance for an acceptable performing observing policy. However, it is again preferable and more intuitive to threshold the variance of the predicted reward uncertainty.

We also highlight the Cartpole environment is notably more complicated than the Cancer environment, as it is an underactuated system, as it has two degrees of freedom [x, θ], however only the carts

Table 12: Ablation experiment of using a threshold on the variance of state instead of the predicted reward, for the Cartpole environment.

Policy Utility U

Active Observing Control (Reward Uncertainty Threshold) 144 5.51 Active Observing Control (State Uncertainty Threshold) 98.4 2.64

position is actuated, leaving θ indirectly controlled. We note that the control actuator is bounded and limited, and the force is such that the Cartpole cannot be directly swung up rather it must gain momentum through a swing and then stabilize this swing to not overshoot when stabilizing the pole upwards in the goal position, as indicated when the tip of the pole reaches the center of the red target square.

L.2 Sparse Reward Environment

To investigate how well Active Observing Control generalizes to other environments that have possibly sparse reward functions, we performed an additional sparse reward experiment. We implemented a sparse Pendulum environment [Chakraborty et al., 2022] which only provides a reward when the Pendulum state angle θ is within θCut Off degrees from the goal state upright (θ = 0). That is using a sparse reward function defined by r (s, a, t) = r(s, a, t) 1{|θ| < θCut Off}. This allows us to vary the degree of sparsity by decreasing θCut Off to create a sparser reward for the Pendulum environment. This is tabulated in Table 13, with each result averaged over 100 random seeds, and we normalize scores to 100 for Continuous Planning with θCut Off = 180 .

Table 13: Sparse reward Pendulum environment.

θCut Off Policy Utility U Reward R Observations O

θCut Off = 180 Random 0 0 0 0 13 0 θCut Off = 180 Continuous Planning 100 9.48 100 9.48 50 0 θCut Off = 60 Continuous Planning 18.8 17.7 18.8 17.7 50 0 θCut Off = 5 Continuous Planning -7.81 0.00125 -7.81 0.00125 50 0 θCut Off = 180 Active Observing Control 109 13.8 98.5 10.2 48 0.954 θCut Off = 60 Active Observing Control 86.7 13.5 12 16.5 35.4 4.08 θCut Off = 5 Active Observing Control 223 2.32 -7.81 0.00146 4.8 0.452

We note that Active Observing Control relies on MPC being feasible in the environment and that it is possible to solve the environment with the chosen MPC decision policy π with regular observing at each time step. In our experimental setup, this translates that the baseline of Continuous Planning should be able to solve the environment to achieve an optimal reward (i.e., R = 100).

The above results demonstrate the well-known [Karnchanachari et al., 2020] limitation of MPC in general, that a vanilla MPC decision policy π struggles in environments with sparse rewards. We highlight that this is not unique to MPC, and standard RL methods in general significantly struggle with sparse rewards in continuous-state and continuous-action tasks [Chakraborty et al., 2022] without additional assumptions or data (reward shaping or using expert demonstrations). This arises due to the limited receding planning horizon time H, however, there exists works to combine trajectory optimization (MPC) with learning a reward estimation [Karnchanachari et al., 2020, Hoeller et al., 2020]. This is an interesting direction for future work, though we believe this is out of scope for this paper.

L.3 Increasing the Planning Resolution of MPC (decreasing δa)

We note that planning with a higher resolution MPC planner (smaller δa) for all the baselines will make the baselines perform on par with Active Observing Control for that δa. However, our motivation is to compare the observing policies ρ of each baseline, therefore in all the implemented baselines we use the same decision policy π of the same MPC algorithm with the exact same hyperparameters.

We performed an additional re-run of the baselines on the Cancer environment, with varying δa = {0.1, 0.05, 0.02, 0.01} and each result averaged over 100 random seeds, as tabulated in Table 14. We

kept the MPC planning horizon time H fixed, to the same value as in our original experiments, and let the number of planning time steps K vary, i.e., K = H

Table 14: Increasing the planning resolution of MPC (decreasing δa) for the Cancer environment.

Discretization interval δa Policy Utility U Reward R Observations O

0.1 Random 0 0 0 0 13 0 0.1 Discrete Planning 92.2 1.11 92.2 1.11 13 0 0.1 Discrete Monitoring 91.1 1.72 85.9 1.7 4.94 0.109 0.1 Continuous Planning 100 0.597 100 0.597 13 0 0.1 Active Observing Control 105 0.75 98.4 0.707 3.42 0.102

0.05 Random 0 0 0 0 13 0 0.05 Discrete Planning 96 1.54 95.6 1.68 13 0 0.05 Discrete Monitoring 96.3 1.39 91.6 1.47 7.18 0.153 0.05 Continuous Planning 100 1.36 100 1.48 13 0 0.05 Active Observing Control 104 1.29 98.1 1.33 4.95 0.118

0.02 Random 0 0 0 0 13 0 0.02 Discrete Planning 95 1.25 92.8 1.81 13 0 0.02 Discrete Monitoring 95.9 1.17 88.2 1.7 6.5 0.196 0.02 Continuous Planning 100 1.17 100 1.7 13 0 0.02 Active Observing Control 98.6 0.967 88.8 1.43 2.77 0.0839

0.01 Random 0 0 0 0 13 0 0.01 Discrete Planning 102 0.806 105 1.87 13 0 0.01 Discrete Monitoring 102 0.844 98 1.93 5.89 0.0741 0.01 Continuous Planning 100 1.03 100 2.39 13 0 0.01 Active Observing Control 103 0.937 97.8 2.06 4.79 0.117

Where for each δa we normalize utility to be between 100 and 0, where 100 corresponds to Continuous planning at that δa and 0 a random policy. We note that δa = 0.1 was used, as this is the default value for the ODE-RL environments [Yildiz et al., 2021].

Moreover, in light of the empirical results (Table 14), we can also ask what the performance looks like if we only normalize utility to Continuous planning with δa = 0.1 (the best possible reward) for each baseline whilst varying δa, as tabulated in Table 15.

Interestingly we observe all baselines degrading in performance, however, AOC still has the highest utility amongst the baselines for a set δa. Intuitively the MPC MPPI algorithm produces less optimal action trajectory a plans when the number of discrete actions K in the discrete action sequence increases for a fixed number of parallel rollouts G. We note that this is inherent to using an MPC planner [Williams et al., 2017], and could be improved by increasing G for decreasing δa, however, that is out of scope for this work.

L.4 Discrete Time Controller Trained on an Equidistant Dataset

The Continuous Planning baseline (which uses a continuous-time dynamics model) can perform better than the Discrete Planning baseline (which uses a discrete-time dynamics model), as it has a more accurate dynamics model when both are trained on the irregularly sampled offline dataset D = {(z(ti), a(ti), i)}N i=1. As the offline dataset of state-action trajectories D, i+1 has irregular times between state-action samples i+1 Exp( ), with a mean of = δa seconds.

We test what would happen if the discrete-time dynamics model (for the Discrete Planning and Discrete Monitoring baselines) was instead trained on a regular (equidistant) sampled offline dataset D, i.e., i+1 = = δa, we performed an additional experiment. This is in the Cancer environment, with each result averaged over 100 random seeds, as Tabulated in Table 16.

Empirically this agrees with our intuition that the discrete-time dynamics model baselines do indeed perform better when trained on an offline dataset that is regularly sampled. However, the discrete-time baselines still underperform Active Observing Control, due to their inherent time discretization error, Figure 6.

Table 15: Increasing the planning resolution of MPC (decreasing δa) for the Cancer environment. Here we only normalize utility to Continuous planning with δa = 0.1 (the best possible reward) for each baseline whilst varying δa.

Discretization interval δa Policy Utility U Reward R Observations O

0.1 Random 0 0 0 0 13 0 0.1 Discrete Planning 92.2 1.11 92.2 1.11 13 0 0.1 Discrete Monitoring 91.1 1.72 85.9 1.7 4.94 0.109 0.1 Continuous Planning 100 0.597 100 0.597 13 0 0.1 Active Observing Control 105 0.75 98.4 0.707 3.42 0.102

0.05 Random 0 0 0 0 13 0 0.05 Discrete Planning 85.6 1.44 85.6 1.44 13 0 0.05 Discrete Monitoring 85.9 1.3 82.1 1.26 7.18 0.153 0.05 Continuous Planning 89.3 1.27 89.3 1.27 13 0 0.05 Active Observing Control 92.9 1.2 87.7 1.14 4.95 0.118

0.02 Random 0 0 0 0 13 0 0.02 Discrete Planning 71.8 1.3 71.8 1.3 13 0 0.02 Discrete Monitoring 72.8 1.22 68.6 1.23 6.5 0.196 0.02 Continuous Planning 77.1 1.22 77.1 1.22 13 0 0.02 Active Observing Control 75.6 1.01 69 1.03 2.77 0.0839

0.01 Random 0 0 0 0 13 0 0.01 Discrete Planning 63.2 1.03 63.2 1.03 13 0 0.01 Discrete Monitoring 63.8 1.07 59.2 1.06 5.89 0.0741 0.01 Continuous Planning 60.3 1.32 60.3 1.32 13 0 0.01 Active Observing Control 64.4 1.19 59.1 1.13 4.79 0.117

Table 16: Results for the Cancer environment where the discrete-time dynamics model is trained on a regular (equidistant) sampled offline dataset D.

i+1 Distribution Policy Utility U Reward R Observations O

Irregularly sampled, i+1 Exp( ) Random 0 0 0 0 13 0 Regularly sampled, i+1 = Discrete Planning 95.5 0.667 95.5 0.667 13 0 Regularly sampled, i+1 = Discrete Monitoring 96.1 1.83 90.1 1.83 3.75 0.127 Irregularly sampled, i+1 Exp( ) Continuous Planning 100 0.594 100 0.594 13 0 Irregularly sampled, i+1 Exp( ) Active Observing Control 105 0.755 98.5 0.713 3.4 0.102

L.5 Sensitivity of AOC to δa

δa is often chosen for a user, as it is the average time interval between two state-action points in a trajectory in the collected offline dataset D.

Here MPC depends on a given δa, however, the observing policy is orthogonal to δa. As outlined in Appendix L.3, reducing δa whilst keeping the planning time horizon H and the number of parallel rollouts G the same, reduces the MPC MPPI action trajectory quality, leading to control performance that achieves a lower reward, and hence utility.

We performed an additional experiment to verify this on the Cancer environment, varying δa = {0.1, 0.05, 0.02, 0.01}, each result averaged over 100 random seeds, as tabulated in Table 17. Here we normalize utility to Continuous planning with δa = 0.1 (the best possible reward) for each baseline whilst varying δa.

Table 17: Results for the Cancer environment where vary δa for AOC.

Discretization interval δa Policy Utility U Reward R Observations O

0.1 Random 0 0 0 0 13 0 0.1 Continuous Planning 100 0.597 100 0.597 13 0 0.1 Active Observing Control 105 0.75 98.4 0.707 3.42 0.102 0.05 Active Observing Control 92.9 1.2 87.7 1.14 4.95 0.118 0.02 Active Observing Control 75.6 1.01 69 1.03 2.77 0.0839 0.01 Active Observing Control 64.4 1.19 59.1 1.13 4.79 0.117

L.6 Evaluation of Reward Uncertainty

The reliability of the reward uncertainty depends on the reliability of the probabilistic dynamics model ˆfθ to provide a good predictive uncertainty for a future state. We highlight that the predictive uncertainty of the dynamics model depends only on the training, that of training on the offline dataset of trajectories D which is only performed once. At run-time the dynamics model is only used for inference, therefore its predictive uncertainty for the state-action space will remain the same, independent of the number of samples of the state taken in a given evaluation episode.

To illustrate this point, we performed an additional experiment on the Cancer environment where, at run-time, we vary the number of observations (samples of the state taken) in an episode and evaluate the negative log-likelihood (NLL) [Lakshminarayanan et al., 2017] of the predicted reward uncertainty to the ground truth reward uncertainty for 100 random seeds, as tabulated in Table 18.

Table 18: Evaluation of reward uncertainty of AOC on the Cancer environment.

State Observations NLL

1 8.261 1.124 10 8.138 1.060 100 8.190 1.052 1,000 8.241 0.973 10,000 7.933 0.288 100,000 7.988 0.093

L.7 Relation Between Rewards R and the Fixed Observation Cost c

An exact theoretical relation between rewards R and the fixed observation cost c is intractable to provide for the general case.

We encounter this intractability even for our simple analytic system used to prove Proposition 2.1. As let us start with Equation (1), U = R T 0 r(s(t), a(t), t)dt c|{ti : ti [0, T]}|, where the Reward is R = R T 0 r(s(t), a(t), t)dt and Cost is C = c|{ti : ti [0, T]}|. For even a simple regular observing policy (Appendix C), Eρ(reg,δ)[U], which is a function of δ (0, T), as well as the reward function r(s(t), a(t), t) and the observation cost c.

If we keep r, π, c fixed, and attempt to find ρ = argmaxρE[U] by only varying δ, that is by d dδE[U] = 0, calculating a closed-form solution for d dδE[U] is intractable.

Instead, we can empirically simulate different values of δ, then see which grid sweep value of δ maximizes E[U], such that approximately δ = argmaxδE[U], this value can then be used to calculate the Reward R = R T 0 r(s(t), a(t), t)dt.

We do this below on the Cancer environment where we use a regular observing policy, that of Continuous Planning, and vary δ over a grid sweep such that different run settings of the policy take different observations, as tabulated in Table 19. We start with setting c = 1 and only normalize the Reward R to 100 for the most frequent samples possible (O = 13), each result calculated for 100 random seeds.

Table 19: Cancer environment where we use a regular observing policy, that of Continuous Planning, and vary δ over a grid sweep such that different run settings of the policy take different observations.

Observations O Reward R Utility U

13 0 100 0.616 87 7 0 99.3 0.656 92.3 4 0 96 0.821 92 2 0 90.7 1.01 88.7 1 0 89.2 1.1 88.2

For a given cost c, i.e., in this case c = 1, we see that there exists an approximate number of observations that maximize the Utility, here for c = 1 this is O = 7. This allows us to vary the observation cost c numerically, and for each observation cost, determine the number of observations (or regular observing frequency) that maximizes the Utility U (O = argmax OE[U]) and determine what the Reward R for this O is. This is tabulated in Table 20.

Table 20: Cancer environment where we vary the observation cost c numerically, and for each observation cost, determine the number of observations (or regular observing frequency) that maximizes the Utility U (O = argmax OE[U]) and determine what the Reward R for this O is.

Fixed Observation Cost c Optimal Utility U Optimal number of Observations O Reward R

0 100 13 100 1 92.3 7 99.3 0.656 2 88 4 96 0.821 3 86.2 1 89.2 1.1

Empirically we observe the relation between the c and R is that the fixed observation cost c is negatively correlated to the Reward R, when using a policy that maximizes the Utility U, Equation (1).

L.8 Dependence on the Accuracy of the Learned Dynamics Model

To understand the dependence on the accuracy of the learned dynamics model further, we performed an additional experiment ablation, where we benchmarked all methods with a less accurate dynamics model training all the dynamics models with fewer samples (10% of the total amount of samples used in training the dynamics models presented in the main paper), here trained on 100,000 samples. This is tabulated in Table 21.

We observe that AOC still achieves a high average utility U on all environments, outperforming the competing Continuous Planning and Discrete Monitoring methods. Specifically, this empirically further verifies the key theoretical contribution that regular observing is not optimal and that irregularly observing can achieve a higher expected utility.

Table 21: Ablation with a less accurate dynamics model training all the dynamics models with fewer samples (10% of the total amount of samples used in training the dynamics presented in the main paper), here trained on 100,000 samples. We observe that AOC still achieves a high average utility U on all environments, outperforming the competing Continuous Planning and Discrete Monitoring methods. Normalized utilities U, rewards R and observations O for the benchmark methods, across each environment. Results are averaged over 1,000 random seeds, with indicating 95% confidence intervals. Utilities and rewards are undiscounted and normalized to be between 0 and 100, where 0 corresponds to a Random agent, and 100 corresponds to the expert, that of Continuous Planning, taking state observations at every δa.

Cancer Acrobot Cartpole Pendulum Policy U R O U R O U R O U R O

Random 0 0 0 0 13 0 0 0 0 0 50 0 0 0 0 0 50 0 0 0 0 0 50 0 Discrete Planning 99.3 0.701 99.3 0.701 13 0 67.2 6.46 67.2 6.46 50 0 117 9.05 117 9.05 50 0 51 3.13 51 3.13 50 0 Discrete Monitoring 74.4 2.76 69.1 2.73 4.51 0.161 142 6.4 40 6.37 5.48 0.111 709 36.9 -97.1 36.9 6 0 249 1.18 13.6 1.94 6.3 0.24 Continuous Planning 100 0.754 100 0.754 13 0 100 6.74 100 6.74 50 0 100 2.51 100 2.51 50 0 100 2.87 100 2.87 50 0

Active Observing Control 104 0.767 98.3 0.735 3.62 0.0968 157 6.93 54.5 6.97 5.29 0.0948 825 2.35 0.427 2.35 5 0 277 1.28 39.4 1.25 5.99 0.0527

L.9 Extending AOC to Work with a Learned Reward Model

To investigate whether AOC can also be used with a learned reward model, we performed an additional experiment, where we trained an MLP reward model (4-layer MLP with 128 units and Tanh activations) from the offline dataset in all the benchmarks this is tabulated in Table 22.

We observe that AOC still achieves a high average utility U on the Cancer environment, outperforming the competing Continuous Planning and Discrete Monitoring methods. This empirically verifies that our proposed initial approach can still perform well using a learned reward model and that the theoretical contribution still holds.

Table 22: We use a learned MLP reward model (4-layer MLP with 128 units and Tanh activations) from the offline dataset in all the benchmarks. We observe that AOC still achieves a high average utility U on the Cancer environment, outperforming the competing Continuous Planning and Discrete Monitoring methods. Normalized utilities U, rewards R and observations O for the benchmark methods, across the Cancer environment. Results are averaged over 1,000 random seeds, with indicating 95% confidence intervals. Utilities and rewards are undiscounted and normalized to be between 0 and 100, where 0 corresponds to a Random agent, and 100 corresponds to the expert, that of Continuous Planning, taking state observations at every δa.

Policy U R O

Random 0 0 0 0 13 0 Discrete Planning 91.4 0.383 91.4 0.383 13 0 Discrete Monitoring 95.1 0.4 91.6 0.392 7.63 0.0631 Continuous Planning 100 0.151 100 0.151 13 0

Active Observing Control 105 0.178 98.9 0.168 3.38 0.0303

L.10 AOC Also Empirically Works for Non-linear State Transformations

We performed an additional experiment to investigate AOC s performance within environments that utilize observations stemming from non-linear state transformations. We tailored the existing Cancer environment to render observations via the non-linear state transformation function z(t) = 0.1(s(t) + ϵ(t))2 + (s(t) + ϵ(t)). The subsequent results, conducted across 1,000 random seeds, are outlined in Table 23. We observe that AOC still achieves a high average utility U on all environments, outperforming the competing Continuous Planning and Discrete Monitoring methods. Specifically, this empirically further verifies our key theoretical contribution that regular observing is not optimal and that irregularly observing can achieve a higher expected utility.

Table 23: We adapted the existing Cancer environment to render observations via the non-linear state transformation function z(t) = 0.1(s(t) + ϵ(t))2 + (s(t) + ϵ(t)). We observe that AOC still achieves a high average utility U on all environments, outperforming the competing Continuous Planning and Discrete Monitoring methods. Normalized utilities U, rewards R and observations O for the benchmark methods, across the Cancer environment. Results are averaged over 1,000 random seeds, with indicating 95% confidence intervals. Utilities and rewards are undiscounted and normalized to be between 0 and 100, where 0 corresponds to a Random agent, and 100 corresponds to the expert, that of Continuous Planning, taking state observations at every δa.

Policy U R O

Random 0 0 0 0 13 0 Discrete Planning 91.7 0.397 91.7 0.397 13 0 Discrete Monitoring 90.5 0.559 85.4 0.548 5.25 0.0329 Continuous Planning 100 0.21 100 0.21 13 0 Active Observing Control 104 0.221 98.9 0.208 4.68 0.0315

L.11 Further Empirical Validation in Other Real-World Scenarios

To verify the theoretical contribution in additional real-world scenarios, we show empirically in the following that AOC can successfully operate in the real-world environments (medical and engineering) of an accurate Glucose environment (controlling the injected insulin for a diabetic patient [Lenhart and Workman, 2007]), a (Human Immunodeficiency Virus) HIV environment (controlling the chemotherapy dose for affecting the infectivity of HIV in a patient [Butler et al., 1997]), and a Quadrotor environment (controlling the actuators of an unmanned aerial vehicle [Nonami et al., 2010]). Each environment represents a unique challenge and has direct implications in medical and engineering applications, thus reflecting the broad applicability of our method. These are:

Glucose environment. Controlling the injected insulin for a diabetic patient to regulate their blood glucose level here observations are costly as a blood test must be performed to measure the glucose level [Lenhart and Workman, 2007].

HIV environment. Controlling the chemotherapy dose for affecting the infectivity of HIV in a patient where observations are costly as a blood test must be performed to measure CD4*T cell levels [Butler et al., 1997]. Quadrotor environment. Controlling the actuators of an unmanned aerial vehicle where observations can be costly due to performing an expensive (power and compute) localization measure [Nonami et al., 2010].

We observe that AOC still achieves a high average utility U on all environments, outperforming the competing Continuous Planning and Discrete Monitoring methods, reinforcing our theoretical claims, and extending our empirical validation as tabulated in Table 24.

Table 24: Normalized utilities U, rewards R, and observations O for the benchmark methods, across each environment. AOC performs the best across all environments. Results are averaged over 1,000 random seeds, with indicating 95% confidence intervals. Utilities and rewards are undiscounted and normalized to be between 0 and 100, where 0 corresponds to a Random agent, and 100 corresponds to the expert, that of Continuous Planning, taking state observations at every δa.

Glucose HIV Quadcoptor Policy U R O U R O U R O

Random 0 0 0 0 50 0 0 0 0 0 50 0 0 0 0 0 50 0 Discrete Planning 96 0.485 96 0.485 50 0 152 0.121 152 0.121 50 0 101 0.00882 101 0.00882 50 0 Discrete Monitoring 120 0.489 92.9 0.493 15.7 0.0466 2.77e+03 0.455 126 0.455 7 0 1.79e+03 0.199 101 0.0142 5.99 0.00518 Continuous Planning 100 0.39 100 0.39 50 0 100 0.431 100 0.431 50 0 100 0.015 100 0.015 50 0 Active Observing Control 126 0.41 99.9 0.39 17.5 0.0336 2.83e+03 1.56 107 0.878 5.75 0.0269 1.83e+03 0.0789 97.9 0.0258 5 0.00196