# deep_recurrent_optimal_stopping__5e9c235d.pdf Deep Recurrent Optimal Stopping Niranjan Damera Venkata Digital and Transformation Organization HP Inc., Chennai, India niranjan.damera.venkata@hp.com Chiranjib Bhattacharyya Dept. of CSA and RBCCPS Indian Institute of Science, Bangalore, India chiru@iisc.ac.in Deep neural networks (DNNs) have recently emerged as a powerful paradigm for solving Markovian optimal stopping problems. However, a ready extension of DNN-based methods to non-Markovian settings requires significant state and parameter space expansion, manifesting the curse of dimensionality. Further, efficient state-space transformations permitting Markovian approximations, such as those afforded by recurrent neural networks (RNNs), are either structurally infeasible or are confounded by the curse of non-Markovianity. Considering these issues, we introduce, for the first time, an optimal stopping policy gradient algorithm (OSPG) that can leverage RNNs effectively in non-Markovian settings by implicitly optimizing value functions without recursion, mitigating the curse of non-Markovianity. The OSPG algorithm is derived from an inference procedure on a novel Bayesian network representation of discrete-time non-Markovian optimal stopping trajectories and, as a consequence, yields an offline policy gradient algorithm that eliminates expensive Monte Carlo policy rollouts. 1 Introduction In the classic optimal stopping setting, an agent monitors the trajectory of a stochastic process, with the opportunity to either continue observing or stop and claim a reward (or suffer a cost), which is generally a function of process history. Once the agent decides to stop, no further actions are possible, and no further reward can be claimed (or cost suffered). The agent s goal is to produce stopping decisions based on the process observations to maximize the expected reward (or minimize the expected cost). For example, whether to exercise a stock option or not at any time is an optimal stopping problem where the goal is to maximize expected exercise value, as is the decision of when to replace a degrading machine. Optimal stopping methods have a wide array of applications in finance [5, 2], operations research [17, 24, 10], disorder/change-point detection [36, 32], early classification of time-series [13, 1] among others. In this paper, we are concerned with the discrete-time, finite-horizon, model-free setting, where the process evolves over discrete time-steps, and a decision to stop must occur by a given finite horizon. Further, the dynamics of process evolution are unknown. This setting is in contrast with classic approaches [8, 28, 35, 14, 36] which assume that the dynamics of process evolution are known and characterized with simplified and usually Markovian stochastic models. Note that continuoustime optimal stopping problems, including the pricing of American options, may often be reduced to the discrete-time setting by suitable discretization [18]. Solving optimal stopping problems in non-Markovian settings is challenging due to the following curses: The curse of dimensionality[3, 4, 33]: A non-Markovian process may be made Markovian by extending state space with process history. However, such an extension significantly increases the problem s dimensionality, complicating function approximation. Estimating high-dimensional value functions directly rules out conventional approaches that approximate value functions with a linear combination of basis functions [40, 23, 39, 40, 43]. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). The curse of non-Markovianity [41, 6]: One may mitigate the curse of dimensionality by approximating a non-Markovian process by a Markovian process with a transformed state-space. For example, RNNs can learn to summarize relevant process history into a Markovian hidden-state process efficiently. However, state-of-the-art Markovian optimal stopping methods [19, 20, 2, 16] rely on the recursive computation of value functions (see Section 2) which propagates approximation bias in non-Markovian settings, a phenomenon known as the curse of non-Markovianity [6]. Existing deep neural network approaches: Deep neural network approaches to optimal stopping [19, 20, 2, 16] have emerged as a powerful way of handling high-dimensional Markovian optimal stopping problems. Existing model-free methods compute value function approximations relying on a recursive temporal relationship between value functions at successive time-steps in Markovian settings embodied by the Wald-Bellman equation [36] or closely related equivalent (see Section 2). Backward Induction methods [19, 20, 2, 16] start from the last time-step and recurse backward sequentially through time. Non-Markovian settings force Backward Induction methods to augment their state space with process history [2] significantly exacerbating the curse of dimensionality since elegant approaches such as RNNs, which share parameters across time-steps are not an option for backward induction methods. If the value function of time-step j + 1 is already optimized, it cannot change when we fit the value function at time-step j. Further, the inability to share parameters across time-steps means parameter space grows linearly with the time-steps processed, thereby reducing sample efficiency. A notable exception is the method of Herrera et al. [16], which uses an RNN. However, the RNN weights are random and not learnable to permit backward induction to proceed. Only the final hidden-layer weights are fit, and these are not shareable across time-steps. Fitted Q-Iteration (FQI) methods [34, 16] start with bootstrapped value functions at all time-steps and recursively update these for temporal consistency with the Wald-Bellman equation. While they can use RNN-based function approximators, they suffer from the curse of non-Markovianity. We note that a body of work uses deep neural networks to solve continuous-time optimal stopping problems by formulating them as free-boundary Partial Differential Equations (PDE). These include methods such as the deep Galerkin method [37] and the backward stochastic differential equation method [7]. We consider such methods model-based since they start with a specific PDE to be solved that assumes prior knowledge of process evolution. Outline of contributions: The above context points to (1) using a suitable characterization of state space (as afforded by RNNs) to mitigate the curse of dimensionality and (2) direct policy estimation methods such as policy gradients to mitigate the curse of non-Markovianity. Our approach brings together, for the first time, probabilistic graphical models, policy gradient methods [42], and RNNs to design effective stopping policies for non-Markovian settings. policy gradient methods are notably missing from the optimal stopping literature. We make the following contributions: We present a reward augmented trajectory model (RATM) (see Section 3), a novel Bayes net parameterization of non-Markovian state-action-reward trajectories encountered in optimal stopping, where conditional probability distributions (CPDs) encode stochastic policy actions and reward possibilities (Section 3). In particular, this formulation allows the stochastic policy to share parameters across time-steps, allowing RNN-based CPD approximations, which mitigate the curse of dimensionality issues discussed earlier. Also, the problem of finding an optimal stopping policy reduces to a direct policy optimization procedure based on the E-M algorithm (Theorem 3.1, and Corollary 3.1.1) that leverages inference over the Bayes net. This mitigates the curse of non-Markovianity associated with recursive value function estimation. Further, modeling optimal stopping with graphical models opens new avenues. As noted by Levine [22], in the context of reinforcement learning, "The extensibility and compositionality of graphical models can likely be leveraged to produce more sophisticated reinforcement learning methods, and the framework of probabilistic inference can offer a powerful toolkit for deriving effective and convergent learning algorithms for the corresponding models"; for instance, augmentation with additional latent variables may be explored to model process disorders such as change-points. Inference on the RATM yields a policy gradient algorithm when we take an incremental view of the E-M algorithm (see Section 4, Theorem 4.1 and Proposition 4.1). We call this algorithm the optimal stopping policy gradient (OSPG) method, the first policy gradient algorithm, to the best of our knowledge, for computing optimal stopping policies suitable for non-Markovian settings. A key benefit of OSPG is that unlike classic policy gradients [42], it is an offline algorithm and does not require Monte-Carlo policy rollouts. Crucially, we show (Corollary 4.1.1) that our procedure is learning a policy that implicitly optimizes value functions at every time-step without requiring recursive value function computation, thereby mitigating the curse of non-Markovianity. We introduce a new loss function based on OSPG that is well-suited for learning RNN-based stopping policies with mini-batch stochastic gradient descent (Algorithm 1). This makes OSPG (unlike E-M) practical for large datasets while leveraging powerful off-the-shelf optimizers such as Adam. The OSPG loss considers an entire process trajectory and does not decompose across timesteps, allowing policy decisions across a trajectory to be evaluated jointly rather than independently. 2 Discrete-time finite-horizon optimal stopping We represent the stochastic process monitored by d-dimensional random fully observable environment state variables {Sj}. Process history until and including time-step j is represented by the d j random matrix Sj := [S0, S1, Sj]. Associated with state history is a reward process {Rj} with Rj = gj(Sj), where gj : Rd j 7 R+ is a reward function that maps observation history to a reward for stopping at time-step j. Uppercase letters denote random variables, while lowercase letters represent their realizations. Bold font is used to indicate histories. Thus sj is an observed partial trajectory from Sj. Also, in case we need to refer to a specific trajectory, we use an extra index i, so for example, rij refers to the observed reward for stopping at time-step j when monitoring the ith process trajectory. Definition 2.1 (stopping policy). A stopping policy is defined as a sequence of functions ϕ = (ϕ1, ϕ2, ϕj, ), where ϕj : Rd j 7 {0, 1} maps the history of observations at a time-step j to a decision of whether to stop (ϕj(sj) = 1) or not (ϕj(sj) = 0). Definition 2.2 (stochastic stopping policy). A stochastic stopping policy is defined as a sequence of functions φ = (φ1, φ2, φj, ), where φj : Rd j 7 [0, 1] maps the history of observations at a time-step j to a probability of stopping at time j. Note that a stochastic policy may be converted to a deterministic stopping policy to make stopping decisions by augmenting the state space with i.i.d. virtual observations from a random variable U uniformly distributed in [0, 1] Thus, the equivalent deterministic policy is ϕj(sj) := I(φj(sj) uj). This paper considers the finite horizon setting where j H N, so a decision to stop must be made at or before a finite horizon H. Thus we define ϕH(SH) := 1. The decision to stop is based on a stochastic process, making the stopping time a random variable. Definition 2.3 (policy stopping time [31]). Stopping occurs when the policy first triggers a stop action. This occurs at random time τ := min{0 j H : ϕj(Sj) = 1}, called the stopping time1. With this background, we can formally define the optimal stopping problem: Definition 2.4 (optimal stopping problem [36]). Solve for the optimal stopping time τ (if one exists) for which E [Rτ ] = supτ E [Rτ] where Rj = gj(Sj) is a reward process, τ is a stopping time random variable, and {Sj} is a stochastic process. Following Shiryaev and Peskir [36, 28], in the finite-horizon, discrete-time setting, the existence of a finite optimal stopping-time is guaranteed when E supj H |Rj| < . The optimal stopping time can then be determined by the method of backward induction which recursively constructs the Snell-envelope2 comprising random variables {Uj} given by: UH := RH , Uj = max {Rj, E [Uj+1|Sj]} , j = H 1, 0 (Snell-envelope) The optimal stopping time is given by τ = min{0 j H : Uj = Rj}. From a practical perspective, the expectations in the recursion are hard to estimate since they depend on process history, manifesting the curse of dimensionality. The assumption of Markovian state evolution, i.e. P(Sj+1|Sj) = P(Sj+1|Sj), and a reward process expressable as a function Gj of the current state, 1This is equivalent (see [15]) to the classic definition [36] of a stopping time w.r.t a filtration {Fj} as a RV such that {τ = j} Fj j. Since {Fj} is the minimal filtration generated by {Sj} in our setting, we can determine if τ = j or not by considering sj. We use lowercase to represent τ for consistency with literature. 2The Snell-envelope is also the smallest super-martingale that dominates/envelopes the reward process [36]. S0 S1 Sτ SH (a) state-action trajectory model R0 R1 Rτ RH A0 A1 Aτ S0 S1 Sτ SH (b) reward augmented Bayesian network model Figure 1: Bayesian networks representing (a) state-history (Sj) and action (Aj) trajectories and (b) corresponding reward augmented trajectory model for optimal stopping. i.e. Rj = Gj(Sj), directly leads to the much more tractable Wald-Bellman equation [28, 35, 36]: VH(s) := GH(s) , Vj(s) = max {Gj(s), Tj Vj+1(s)} , j < H (Wald-Bellman equation) The value functions Vj(s) give the maximum expected reward obtainable from time j having observed state s. Thus Vj(s) = supτ j Es [Rτ] j. The operator Tj is defined by Tj F(s) := Ej,s [F(Sj+1)] for any function F of the state. The Snell-envelope reduces to Uj = Vj(Sj) in this case. We define Kj(s) := Tj Vj+1(s) as the continuation value-function, which is the maximum expected reward one can obtain by choosing to continue at step j having observed state s. The optimal stopping policy is given by: ϕ H(Sj) := 1 , ϕ j(Sj) = I (Gj(Sj) Kj(Sj)) , j = 0, 1, H 1 (1) The value functions are, therefore, related to the optimal policy as follows: Vj(Sj) = ϕ j(Sj)Gj(Sj) + (1 ϕ j(Sj))Kj(Sj) (2) Thus, if we can estimate the continuation value functions Kj, we can trigger optimal stopping decisions using equation (1). Note that this requires knowledge of reward function Gj at inference time. Backward induction approaches typically, first estimate KH 1 to approximate KH 1(SH 1) GH(SH) and then use either the Wald-Bellman equation [40, 19] or equations (1) and (2) [23, 20, 16] to produce approximation targets VH 1(SH 1) which are used as targets to fit KH 2(SH 2) VH 1(SH 1), and the process continues backwards recursively. As a variation on this theme, Becker et al. [2] first set KH 1(SH 1) = GH(SH) and then estimate ϕ j(s) in equation (2) by maximizing the RHS of equation (2) with j = H 1 and ϕH 1 replacing ϕ H 1. Then equation (2) is used to compute VH 1(SH 1) which is in turn used to set KH 2(SH 2) = VH 1(SH 1) and the process proceeds recursively. Fitted Q-iteration methods [39, 40, 43] use bootstrapped continuation value functions Kj to produce Vj(Sj) at all time-steps using the Wald-Bellman equation. Then, updated continuation value-functions Kj are fit simultaneously across all time-steps to approximate Vj+1(Sj+1), and the process continues recursively until convergence. If Markovian assumptions do not hold, states Sj need to be replaced with Sj exploding the state space and significantly complicating function approximation (the curse of dimensionality). The recursive procedures discussed above propagate bias across time-steps when using Markovian approximations, since bias in value estimation at one time-step infects the preceding time-step (the curse of non-Markovianity). 3 A Bayesian network view of optimal stopping This section presents a new Bayesian network view of optimal stopping. While we draw inspiration from a corresponding view of general reinforcement learning (RL) [22], the resulting modeling choices needed to capture the structure of optimal stopping make the approaches distinct. State-action trajectory model: Figure 1(a) shows a Bayes net model of state-action trajectories in optimal stopping. An action Aj at any time-step j depends on the state history Sj up to that time-step, permitting general non-Markovian stopping policies. Further, actions are restricted to 0 (continue) or 1 (stop), with the stop action terminating the action trajectory. Therefore, a key feature of optimal stopping problems is that, unlike general reinforcement learning, actions do not impact the probability distribution of the following states over the control horizon, conditioned on state history. We do not model the stopping time τ explicitly. Instead, it is a consequence of stochastic policy actions over the horizon. The action trajectory stops at a random stopping time τ = min{0 j H : Aj = 1}. Therefore, we have the special structure Aτ = 1 and Aj = 0, j < τ. Note that Figure 1(a) represents a family of stopping trajectories Aτ, each member stopping at a specific value of τ. We denote a specific stopping trajectory by Aj. Consider stochastic policies, given by the functions φj(Sj) := P(Aj = 1|Sj). We define ψj(Sj) := P(τ = j|Sj) to represent the stopping-time distribution. We have the following key relationship between action trajectories and the stopping-time random variable encapsulated in the following lemma, adapted from [14], proved in Appendix A. Lemma 3.1 (trajectory reparameterization lemma). P(τ = j|SH) = ψj(Sj) = P(Aj|Sj) = φ0(S0) if j = 0 n=0 (1 φn(Sn)), if 0 < j < H n=0 (1 φn(Sn)), if j = H Reward augmented trajectory model: Figure 1b shows an augmentation of the basic state action trajectory model to bring the notion of stopping rewards into the model. The variables Rj represent the reward process, which are functions of state history: recall Rj = gj(Sj). We introduce Zτ as a family of binary random variables representing if a reward is obtained (Zτ = 1) or not (Zτ = 0) when we stop at various values of τ. Thus stopping at j no longer guarantees reward Rj. The idea is to use random variables Zτ to transform absolute rewards which are not readily interpreted as probabilities into probabilities (in this case, probabilities of obtaining a reward at any time-step) that may be represented in the Bayes net. We parameterize the conditional probability distribution of Zτ with Bernoulli distributions defined as follows: P(Zτ = 1|RH, Aτ) = Rτ := r(τ, RH) P(Zτ = 0|RH, Aτ) = 1 Rτ (3) where r : {0, 1, H} (R+)H 7 [0, 1] is a function that transforms real-valued stopping rewards into probabilities that encode the notion of the relative possibility of obtaining rewards over a trajectory by stopping at a particular time-step of the trajectory. We may write the joint distribution of a trajectory stopping at time τ as: P(Aτ, Zτ, RH, SH) = P(S0)P(R0|S0) j=1 P(Sj|Sj 1)P(Rj|Sj) | {z } P(SH)P(RH|SH) n=0 P(An|Sn) | {z } P(Aτ |Sτ ) P(Zτ|Aτ, RH) = P(SH)P(RH|SH)P(τ|SH)P(Zτ|Aτ, RH) (4) We have used the reparameterization lemma 3.1 to replace the probability distribution over actions with the corresponding induced probability distribution over stopping times. Conditioning on RH and SH we have: P(Aτ, Zτ|RH, SH) = P(τ|SH)P(Zτ|Aτ, RH) (5) Optimal policy estimation: We are particularly interested in the conditional probability of getting a reward when we stop. We can obtain this by using Bayes net inference to obtain equation (5) and then marginalizing over the family of stopping trajectories Aτ. Since there is a one-to-one correspondence between Aτ and stopping time τ, we just need to sum over all possible stopping times. Thus, if we define Z := Z0 Z1 ZH, where is the XOR operator3, we have, from equation (5): P(Z = 1|RH, SH) = j=0 P(τ = j|SH)P(Zj = 1|Aj, RH) = j=0 ψj(Sj) Rj (6) Ideally, we would want our policy to maximize this conditional probability. To this end, we consider parameterized policy functions φθ j and induced stopping time distributions ψθ j and define the conditional likelihood function over a sample trajectory as: l(θ |s H, r H) = P(Z = 1|s H, r H; θ) = j=0 ψθ j (sj) rj (7) 3We use the XOR operator since exactly one of the {Zj} can be equal to one. The reward for a trajectory can only be claimed by a stop action at a single time-step. However, maximizing this likelihood over samples while allowing optimal stopping decisions on any trajectory ignores the relative magnitude of rewards between different trajectories. Thus, we seek to weight the likelihood of each trajectory. Therefore, considering a dataset D of environment trajectories and using index i for the ith trajectory in the sample, we have: i=1 l(θ |si,H, ri,H) wi with i=1 wi = 1 (8) where wi is the relative weight for the ith sample. Taking logarithms, we have the following objective to maximize: j=0 ψθ j (sij) rij (9) We may use the following Expectation-Maximization (E-M) procedure to maximize this objective: Theorem 3.1 (E-M procedure for optimal stopping policy). The following iterative procedure converges to a local maximum of JWML(θ) with monotonic increase in the objective with iteration index (k), starting with an initial value θ = θ(0): q(k) ij = ψθ(k) j (sij) rij PH j=0 ψθ(k) j (sij) rij 0 j H, 1 i N (E-step) J(k) M (θ) = j=0 w(k) i q(k) ij log ψθ j (sij) , θ(k+1) = arg max θ J(k) M (θ) (M-step) where w(k) i = wi, k. The proof follows from an application of Jensen s inequality and is given in Appendix A. Dayan and Hinton pioneered using E-M methods for RL policy search [12]. Specification of { rij} and w: rij is a specification of the relative intra-trajectory probability of obtaining a reward when stopping at time-step j of the ith trajectory. A natural choice (and one that we adopt in this paper) is to set rij := rij PH j=0 rij , where rij is a realization of Rj = gj(Sj) corresponding to the ith trajectory. This formulation encourages stopping at time-steps of the trajectory, proportional to achievable reward. Alternatively, formulations such as rij := rij maxk rik or rij := exp(rij) PH k=0 exp(rik) could be used to express different stopping preferences within a trajectory. The weights w, on the other hand, express relative inter-trajectory importance. One plausible choice would be to set PH j=0 rij PN i=1 PH j=0 rij , the relative reward available for extraction from a trajectory. Similarly, one could weight trajectories by the maximum reward achievable: wi = ri = maxj rij P i maxj rij . We may also consider reweighting schemes that vary weights over rounds of the E-M algorithm. In the RL literature, Reward-weighted regression [30] takes such an approach. Reward-weighted regression typically conducts rollouts of the current policy and weights each resulting sample trajectory with the relative reward achieved by the current policy on that trajectory. However, the trajectory reward is unknown since we do not perform policy rollouts. Instead, we weight each trajectory offline by its relative expected reward under the current policy. Specifically, we compute weights at each round according to a W-step: PH j=0 ψθ(k) j (sij)rij PN i=1 PH j=0 ψθ(k) j (sij)rij 1 i N (W-step) This weighting is not arbitrary as it leads to a surprising equivalence with a policy gradient method, established in the next section. Further, we have the following Corollary (proof in Appendix A): Corollary 3.1.1 (Reweighted E-M). Let ri = PH j=0 rij PN i=1 PH j=0 rij . Using the reweighting scheme of the W-step in the E-M procedure of Theorem 3.1 achieves a local maximum of: JWML( w, θ) = j=0 ψθ j (sij) rij i=1 wi log wi with monotonic increase in the objective, starting with an initial value θ = θ(0). The second term is the KL-divergence between distributions w and r. 4 Optimal stopping policy gradients (OSPG) In this section we show that an incremental version of the E-M algorithm described in the previous section, with a single gradient step replacing a full M-step, is equivalent to a policy gradient method that maximizes expected reward. We call this the optimal stopping policy gradient (OSPG) method. Consider a direct maximization of the optimal stopping objective JOS(θ) = E[Rτ] from definition 2.4 where the expectation is over state-action trajectories. We can use our Bayes net model of state-action trajectories without reward augmentation shown in Figure 1a to obtain: P(SH, Aτ| θ) = P(S0) j=1 P(Sj|Sj 1) n=0 P(An|sn, θ) = P(SH)P(τ|SH, θ) (11) where we have used the trajectory reparameterization lemma to reparameterize in terms of stopping times. Therefore, we may write: JOS(θ) = Es H P(s H) Eτ P(τ|s H,θ) [Rτ] = Es H P(s H) j=0 ψθ j (sj)rj We leverage methods used to establish the policy gradient theorem [42, 38] to derive a convenient form for the gradient of the optimal stopping objective w.r.t policy parameters θ. Theorem 4.1 (Optimal stopping policy gradient theorem). For any differentiable stopping policy φθ, The gradient of the objective JOS(θ) = E[Rτ] w.r.t. policy parameters θ, where {Rj} is a risk process and τ is a stopping-time random variable, is given by: θJOS(θ) = Es H P(s H) j=0 rjψθ j (sj) θ log ψθ j (sj) where the ψθ j (sj) are functions of φθ obtained by Lemma 3.1. The proof is an application of the log-derivative trick [42] and is given in Appendix A. Note that optimal stopping policy gradients are estimated using offline state trajectories only. Unlike the general RL case, we never need to do policy rollouts to sample trajectories. Instead, we explicitly calculate the probability distribution over all possible stopping actions (or stopping times) and use this to average over all possible stop-continue decisions. Also, an update to the policy parameters can be done without fresh data collection since we can correctly (due to explicit Bayes net modeling) model the impact of the updated policy on the loss using the already sampled environment trajectories. Appendix B describes specialization to settings with costs instead of rewards. We now establish a surprising equivalence between the reweighted E-M algorithm of Corollary 3.1.1 and the policy gradient method derived in this section when an incremental E-M solution approach [27] is used. Although the objective functions JW ML( w, θ) and JOS(θ) are different in general, the same procedure maximizes both. Proposition 4.1 (Incremental E-M as a policy gradient method). If a partial M-step is used in the reweighted E-M approach of Corollary 3.1.1, comprising of a single gradient-step, then the resulting incremental E-M procedure is equivalent to the optimal stopping policy gradient method of Theorem 4.1, each converging to a local maximum of both JW ML( w, θ) and JOS(θ). The proof is given in Appendix A. Thus, we may view the incremental E-M approach that optimizes a weighted maximum likelihood objective as a generalization of the policy gradient method that seeks to optimize the expected reward objective. The two are equivalent for a particular weighting scheme and Bayes net specification (as in Corollary 3.1.1). Further, the OSPG gradient has the following insightful form, obtained by applying Lemma 3.1 to the M-step objective of Theorem 3.1. Corollary 4.1.1 (Value form of the OSPG gradient). Let vij := ψθ j (sij)rij, kij := h PH n=j+1 vin i . We may rewrite the policy gradient of Theorem 4.1 as: θJOS(θ) = 1 " vij(1 φθ j (sij)) kijφθ j (sij) φθ j (sij)(1 φθ j (sij)) θφθ j (sij) (13) The proof is given in Appendix A. vij is an empirical estimate of immediate reward, while kij is an empirical estimate of continuation value under the current policy. The term in the bracket is the the derivative of the objective w.r.t. the policy output, evaluated at sij. This is positive, and hence calls for increasing stopping probability, if vij kij > φθ j (sij) 1 φθ j (sij), i.e. when the ratio of immediate to continuation value exceeds the current odds of stopping. Of course, the policy is updated considering expected immediate reward and continuation values at all time-steps. The updated policy produces new estimates of immediate reward and continuation value from which a new OSPG gradient is computed, and process proceeds until convergence to an optimal policy. Thus, this process implicitly uses value functions at all time-steps to update the policy which implicitly further improves value functions. Since there is no recursion across time-steps OSPG is able to mitigate the curse of non-Markovianity. Temporal loss functions for learning RNN stopping policies: We may directly adopt the M-step objective of Theorem 3.1 as a loss function for learning an RNN-based policy with mini-batch stochastic gradient-descent (SGD), effective in non-Markovian settings. Algorithm 1 shows the pseudocode for computing our OSPG loss. This loss is ideally suited for learning stopping policies with RNNs since it takes as input stopping probabilities at every time-step φθ j (sij) that are each a function of process observation up to that point. An RNN performs this mapping naturally since it maps input history to a hidden state at each time-step with recurrent computation. So, for example, return_sequences=True, coupled with a Time Distributed Dense layer in Keras, returns a complete trajectory of RNN predictions for input into the loss. Note also that this is a temporal loss that considers the entire process trajectory and does not decompose across time-steps. Algorithm 1 Pseudocode for mini-batch computation of our temporal OSPG loss Input: R := [rij], Φ := φθ j (sij) {1 i Nb, 0 j H, Nb is batch size} Ψ:,0 = Φ:,0, Ψ:,H = exp (PH 1 j=0 log(1 Φ:,j)) {numerically stable computation of ψθ j (sij)} Ψ:,1:H 1 = exp (log(1 Φ:,H 2)U + log Φ:,1:H 1) {U is unit upper triangular, CUMSUM} V = stop-gradient(R Ψ) {held constant, so no gradient. denotes Schur product} Output: J = 1 Nb 1T [V log Ψ]1 5 Experiments We compare our OSPG algorithm using deep neural networks (DNN-OSPG) and recurrent neural networks (RNN-OSPG) against the following model-free discrete-time optimal stopping approaches. Backward Induction: Deep Optimal Stopping (DOS) [2] including an extension of state space with process history (DOS-ES) [2] and Randomized Recurrent Least Square Monte Carlo (RRLSM) [16]. Fitted Q-iteration: While fitted Q-Iteration (FQI) approaches in the optimal stopping literature rely on linear approximation with chosen basis functions [39, 40] break down in high-dimensional settings, they may be readily extended to leverage deep neural network (DNN-FQI) and recurrent neural network (RNN-FQI) function approximators. Appendix C provides details of our corresponding DNN-FQI and RNN-FQI baselines. We conduct experiments with both Markovian and non-Markovian datasets. Markovian data include the pricing of high-dimensional Bermudan max-call options and American geometric-call options (relegated to Appendix D due to space). Non-Markovian data includes stopping a fractional Brownian motion and the early stopping of a sequential multi-class classifier. Except for the American option pricing experiment, all DNN function approximators use two hidden layers of 20 hidden units each, while all RNN function approximators leverage an RNN/GRU with a single hidden layer of 20 units. Backward induction methods (DOS, RRLSM) require independent networks at each time-step, while other approaches share network parameters across time-steps. Details of hyper-parameter settings, model size, and compute times are provided in Appendix D. Exercise of a Bermudan max-call option: In this Markovian setting, we seek optimal strategies to maximize the returns of a Bermudan max-call option, a scenario described in [2]. Bermudan options may be exercised only at fixed times, unlike American options which may be exercised at any time prior to option expiration. The payoff of these options depends on the maximum of d underlying assets with multi-dimensional Black-Scholes dynamics and the strike price K. The dynamics and Table 1: Bermudan max-call option pricing: Results average return (standard deviation) d p0 DOS DNN-FQI DNN-OSPG RNN-FQI RNN-OSPG RRLSM 20 90 37.08 (0.09) 36.98 (0.29) 37.20 (0.08) 36.98 (0.32) 37.21 (0.07) 35.99 (0.05) 20 100 50.86 (0.08) 50.92 (0.05) 51.02 (0.08) 50.76 (0.22) 50.99 (0.09) 49.68 (0.06) 20 110 64.73 (0.11) 64.75 (0.22) 64.91 (0.10) 64.67 (0.17) 64.84 (0.10) 63.42 (0.06) 50 90 53.30 (0.08) 53.38 (0.12) 53.46 (0.08) 53.43 (0.12) 53.36 (0.11) 51.87 (0.09) 50 100 68.87 (0.12) 68.91 (0.17) 69.06 (0.11) 68.98 (0.11) 68.93 (0.07) 67.37 (0.09) 50 110 84.50 (0.13) 84.45 (0.28) 84.65 (0.11) 84.62 (0.19) 84.52 (0.16) 82.85 (0.10) 100 90 65.83 (0.10) 65.94 (0.13) 66.02 (0.14) 65.78 (0.15) 65.80 (0.16) 64.56 (0.08) 100 100 82.79 (0.16) 82.88 (0.17) 82.96 (0.13) 82.82 (0.14) 82.74 (0.17) 81.41 (0.08) 100 110 99.73 (0.16) 99.86 (0.12) 99.93 (0.18) 99.74 (0.20) 99.70 (0.14) 98.28 (0.11) 200 90 78.23 (0.15) 78.40 (0.16) 78.43 (0.12) 78.11 (0.16) 78.20 (0.09) 77.28 (0.07) 200 100 96.57 (0.16) 96.68 (0.13) 96.79 (0.08) 96.21 (0.28) 96.44 (0.13) 95.53 (0.08) 200 110 114.87 (0.10) 115.03 (0.25) 115.10 (0.07) 114.55 (0.27) 114.74 (0.23) 113.78 (0.08) payoff (reward) are given by: Sm t = sm 0 exp([r δm σ2 m/2]t + σW m t , Rt = ert max 1 m d Sm t K + (14) for m = 1, 2, d. r R is the risk-free rate, sm 0 (0, ) represents the initial price, δm [0, ) is the dividend yield, σm (0, ) the volatility and W is a d-dimensional Brownian motion, with zero instantaneous correlation between its d components. The reward for exercise at time t is given by Rt. We discretize t into H + 1 possible exercise opportunities, using times tj = j T/H for j = 0, 1, H. T is the option expiration duration in years. This procedure translates the problem into our discrete-time finite horizon optimal stopping setting: sup0 τ H E[Rτ]. Mirroring Becker et. al. [2] we set, K = 100, r = 0.05, σm = 0.2, δm = 0.1, T = 3, H = 9. We generate a dataset of 40,000 sample trajectories of the underlying assets, generating different datasets for various values of d and s0 = sm 0 , m. We train models on ten random 50% train-test splits, holding 20% of training data as a validation dataset. Means and standard deviations of the achieved exercise values by the various methods are reported in Table 1. DNN-OSPG consistently performs best, while the other Markovian DNN methods, DOS and DNN-FQI, perform well in this setting. Among the non-Markovian RNN-based methods, RLLSM fares the worst due to increased noise from prior time-steps. RNN-OSPG and RNN-FQI are still relatively competent since they can learn to ignore information from prior time-steps. Stopping a non-Markovian fractional Brownian motion: We mimic the scenario considered in [2]. A fractional Brownian motion [25], (W h t )t 0 is a centered-Gaussian process with co-variance structure determined by its Hurst parameter h (0, 1] as E[W h t W h s ] = 0.5(t2h+s2h |t s|2h). The increments of W h are positively correlated for h (0.5, 1] and negatively correlated for h (0, 0.5) and uncorrelated (reducing to a standard Brownian motion) for h = 0.5. We discretize t in the range [0, 1] into time-steps by tj = j/H for j = 0, 1, H, and seek to solve the optimal stopping problem: sup0 τ H E[W h τ ]. Thus, the reward function is the identity mapping in this case. For each distinct setting of the Hurst parameter h, we generate 40,000 trajectories and train models on ten random 50% train-test splits, holding 20% of training data as a validation dataset. Figure 2 plots average returns achieved by competing methods for various values of h. RNN-OSPG dominates competing methods in this non-Markovian setting across all values of h. Also, note the strong outperformance of DNN-OSPG that fits an optimal policy directly vs other Markovian methods (DOS and DNN-FQI) that propagate approximation errors due to recursive estimation of value functions. DOS-ES seeks to adapt to non-Markovian settings by extending its state space with process history. This extension, in turn, explodes the parameter space and training times (see Appendix D, Table 6) while reducing sample efficiency. RRLSM solves this issue by using an RNN but, in turn, trades model flexibility (since the RNN layer is not-trainable). The performance of DOS-ES and RRLSM is quite close in this scenario. Note that even though an RNN is used, the performance of RNN-FQI is significantly worse than RNN-OSPG for positive values of h. Early-stopping a sequential multi-class classifier: This is a non-Markovian setting where we are interested in early stopping a sequential multi-class time-series classifier. The longer one delays 0.2 0.4 0.6 0.8 Hurst parameter h mean reward at stopping time DOS DNN-FQI DNN-OSPG DOS-ES RRLSM RNN-FQI RNN-OSPG Figure 2: Fractional Brownian motion Chlorine Concentration Electric Devices Fifty Words Insect Wingbeat Sound Medical Images Melbourne Pedestrian Mixed Shapes Regular Train Non Invasive Fetal ECGThorax2 Star Light Curves UWave Gesture Library X Word Synonyms avg cost at stopping time RNN-FQI RRLSM RNN-OSPG Figure 3: Stopping a multi-class classifier classification, the easier the task becomes since every new time-step reveals information about the class of the series. When deciding on the class of a given series, there is a tradeoff between the number of observations made and classification accuracy. Early stopping of a classifier is an optimal stopping problem with costs (instead of rewards): infτ E[Cτ] using a cost process {Cj} with Cj = [1 maxk P(Y = k|Sj)] + α j H . Here, Y represents the actual class of the time series, with Sj representing the process history of the time series until time j. The first term represents the probability of a classification error at time j while the second term imposes a cost of delaying the classification decision, controlled by α R+. In practice, the true probabilities P(Y = k|Sj) are unknown and must be approximated by training a classifier to get a cost model. This approach is related but not equivalent to early classification literature [26, 13, 9, 26, 1] where the goal is to learn both classifier and stopping decision jointly. Since our intent in this paper is to solve the optimal stopping problem, not optimize the cost/reward models, we train a simple baseline RNN classifier that learns parameterized approximations P(Y = k|Sj, ζ) P(Y = k|Sj) j, k, with parameters ζ. A limitation of this approach is that we are constrained by the quality of the trained classifier. Our classifier is an RNN with a single 20-unit hidden layer connected to a final time-distributed dense layer with softmax activation which produces a probability distribution over classes at each time-step. We train this model to minimize negative log-likelihood (NLL) to produce a calibrated classifier. There are two possibilities for the cost model. Either we accept the probabilities estimated by the classifier as the true probabilities yielding: Cest j = [1 maxk P(Y = k|Sj, ζ)] + α j H , or use empirical miss-classification error, yielding: Cemp j = I(Y = arg maxk P(Y = k|Sj, ζ)) + α j H . The latter is a more robust choice and is the evaluation metric used since the stopping decisions will not be blind to classifier calibration errors. However, this setting exposes a significant limitation of value-function methods (backward induction and FQI) since they require the cost model to be available at inference time. Thus, they can only use Cest j and not Cemp j , since during inference, empirical errors of the classifier are not revealed until a decision to stop has been made. We select 17 multi-class time-series classification datasets from the UCR time-series repository [11] (see Appendix D for details of the datasets and selection process). We use 10-random 70%/30% (train/test) data splits to train all models, holding out a 30% of training data as a validation set. Figure 3(b) reports the average stopping costs of the policies learned by the various methods on the test set. RNN-OSPG outperforms backward-induction and FQI methods on 15 of the 17 datasets. 6 Conclusions We introduce a policy gradient algorithm well suited to learning RNN-based optimal stopping policies effective in non-Markovian settings. Our optimal stopping policy gradient (OSPG) algorithm leverages a new Bayesian net formulation of non-Markovian state-action-reward trajectories specific to optimal stopping. This formulation leads to an offline policy gradient algorithm where Bayes net inference eliminates inefficient Monte-Carlo policy rollouts. OSPG outperforms competing methods in non-Markovian settings where existing methods must either deal with the curse of dimensionality (from extended state spaces) or the curse of non-Markovianity (from recursive Markovian approximation). A limitation of OSPG is that since it uses a temporal loss defined over complete trajectories, it may not be suitable for scenarios with only a few extremely long trajectories or when trajectories are partially observed. Acknowledgments We acknowledge Jerry Liu, M. Anthony Lewis, Matt Ellis, Scott Hallworth, and Jennifer Hill at HP Inc. for supporting and sponsoring this work. Finally, we thank all the reviewers for their insightful comments that have helped improve the paper. [1] Youssef Achenchabe, Alexis Bondu, Antoine" Cornuéjols, and Asma Dachraoui. Early classification of time series. In Machine Learning, volume 110, pages 1481 1504. Springer International Publishing, 2021. [2] Sebastian Becker, Patrick Cheridito, and Arnulf Jentzen. Deep optimal stopping. J. Mach. Learn. Res., 20:74:1 74:25, 2019. [3] R. Bellman and Rand Corporation. Dynamic Programming. Rand Corporation Research Study. Princeton University Press, 1959. [4] D. Bertsekas and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996. [5] Bruno Bouchard and Warin Xavier. Monte-Carlo Valuation of American Options: Facts and New Algorithms to Improve Existing Methods. Numerical Methods in Finance, pages 215 255, 2012. [6] Siddharth Chandak, Pratik Shah, Vivek S Borkar, and Parth Dodhia. Reinforcement learning in non-markovian environments. Co RR, abs/2211.01595, 2023. [7] Yangang Chen and Justin W. L. Wan. Deep neural network framework based on backward stochastic differential equations for pricing and hedging american options in high dimensions. Quantitative Finance, 21(1):45 67, 2021. [8] Y.S. Chow, H. Robbins, and D. Siegmund. Great expectations: the theory of optimal stopping. Houghton Mifflin, 1971. [9] Asma Dachraoui, Alexis Bondu, and Antoine Cornuéjols. Early classification of time series as a non myopic sequential decision making problem. In Machine Learning and Knowledge Discovery in Databases, pages 433 447. Springer International Publishing, 2015. [10] Niranjan Damera Venkata and Chiranjib Bhattacharyya. When to intervene: Learning optimal intervention policies for critical events. In Advances in Neural Information Processing Systems, volume 35, pages 30114 30126, 2022. [11] Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, and Hexagon-ML. The UCR time series classification archive, October 2018. https://www.cs.ucr.edu/~eamonn/time_series_ data_2018/. [12] Peter Dayan and Geoffrey E. Hinton. Using expectation-maximization for reinforcement learning. In Neural Computation, volume 9, page 271 278, 1997. [13] Don Dennis, Chirag Pabbaraju, Harsha Vardhan Simhadri, and Prateek Jain. Multiple instance learning for efficient sequential data classification on resource-constrained devices. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, 3-8 December 2018, Montréal, Canada., pages 10976 10987, 2018. [14] T.S. Ferguson. Mathematical Statistics: A Decision Theoretic Approach. Probability and mathematical statistics. Academic Press, 1967. [15] Tom Fischer. Stopping times are hitting times: a natural representation. Statistics & Probability Letters, 2011. [16] Calypso Herrera, Florian Krach, Pierre Ruyssen, and Josef Teichmann. Optimal stopping via randomized neural networks. Co RR, abs/2104.13669, 2021. [17] Daniel Jiang and Warren Powell. An approximate dynamic programming algorithm for monotone value functions. Operations Research, 63, 01 2014. [18] P.E. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Applications of mathematics : stochastic modelling and applied probability. Springer, 1992. [19] Michael Kohler, Adam Krzy zak, and N Zaklina Todorovic. Pricing of High-Dimensional American Options by Neural Networks. Wiley-Blackwell: Mathematical Finance, 2010. [20] Bernard Lapeyre and Jérôme Lelong. Neural Network Regression for Bermudan Option Pricing. Monte Carlo Methods and Applications, 2021. [21] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1246 1257, Columbia University, New York, New York, USA, 23 26 Jun 2016. PMLR. [22] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. Co RR, abs/1805.00909, 2018. [23] Francis A. Longstaff and Eduardo S. Schwartz. Valuing American Options by Simulation: A Simple Least-Squares Approach. Review of Financial Studies, 2001. [24] V. Makis and Xiaoyue Jiang. Optimal replacement under partial observations. Math. Oper. Res., 28:382 394, 05 2003. [25] Benoit B. Mandelbrot and John W. Van Ness. Fractional brownian motions, fractional noises and applications. SIAM Review, 10(4):422 437, 1968. [26] C. Martinez, G. Perrin, E. Ramasso, and M. Rombaut. A deep reinforcement learning approach for early classification of time series. In 2018 26th European Signal Processing Conference (EUSIPCO), pages 2030 2034, 2018. [27] Radford M. Neal and Geoffrey E. Hinton. A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants, pages 355 368. Springer Netherlands, Dordrecht, 1998. [28] Goran Peskir and Albert Shiryaev. Optimal stopping and free-boundary problems. Lectures in Mathematics ETH Zürich. Springer, Dordrecht, 2006. [29] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006. [30] Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th International Conference on Machine Learning, ICML 07, page 745 750, 2007. [31] H.V. Poor. An Introduction to Signal Detection and Estimation. Springer Texts in Electrical Engineering. Springer New York, 2013. [32] H.V. Poor and O. Hadjiliadis. Quickest Detection. Cambridge University Press, 2008. [33] W. B. Powell. Approximate Dynamic Programming: Solving the Curses of Dimensionality. Wiley Series in Probability and Statistics. Wiley, 2011. [34] Martin Riedmiller". Neural fitted Q iteration first experiences with a data efficient neural reinforcement learning method. In Machine Learning: ECML 2005, pages 317 328. Springer Berlin Heidelberg, 2005. [35] A. N. Shiryaev. Optimal Stopping Rules. Stochastic Modelling and Applied Probability. Springer Berlin Heidelberg, 2007. [36] A.N. Shiryaev. Stochastic Disorder Problems. Probability theory and stochastic modelling. Springer International Publishing, 2019. [37] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339 1364, 2018. [38] Richard S Sutton, David Mc Allester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 12, 1999. [39] John Tsitsiklis and Benjamin Van Roy. Optimal Stopping of Markov Processes: Hilbert Space Theory, Approximation Algorithms, and an Application to Pricing High-Dimensional Financial Derivatives. IEEE Transactions on Automatic Control, 44:1840 1851, 1997. [40] John Tsitsiklis and Benjamin Van Roy. Regression Methods for Pricing Complex American Style Options. IEEE Transactions on Neural Networks, 12(4):694 703, 2001. [41] Steven D. Whitehead and Long-Ji Lin. Reinforcement learning of non-markov decision processes. Artificial Intelligence, 73(1):271 306, 1995. [42] Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3 4):229 256, may 1992. [43] Huizhen Yu and Dimitri P. Bertsekas. Q-Learning Algorithms for Optimal Stopping Based on Least Squares. European Control Conference, 2007. Deep Recurrent Optimal Stopping: Supplementary Material We provide an outline of the supplement that complements our paper. Appendix A: provides proofs of all Theorems in the paper. Appendix B: considers formulations of stopping problems specified in terms of costs instead of rewards and shows how to transform these into a consistent reward formulation. Appendix C: provides discussion and pseudo-code to implement the temporal loss used to learn fitted Q-iteration policies for DNN-FQI and RNN-FQI Appendix D: includes further details and discussion of the experiments, including model hyperparameter settings, numerical results with confidence intervals, details of compute environment, model sizes, train and inference times, etc. In Appendix E.3, we also include a new experiment that compares OSPG to state-of-the-art PDE benchmarks in the pricing of American options, which is a continuous-time optimal stopping problem. Appendix E: includes a treatment of baseline subtraction in the context of optimal stopping policy gradients (OSPG). A Appendix A: Proofs A.1 Proof of Lemma 3.1 Proof. The Lemma follows from the Bayes net trajectory model of Figure 1a, and the special structure of OS action trajectories: Aτ = 1 and An = 0, n < τ. For any OS state-action trajectory, we have: P(Aτ, SH) = P(S0) j=1 P(Sj|Sj 1) | {z } P(SH) n=0 P(An|Sn) = P(SH) n=0 P(An|Sn) (15) Therefore, conditioning on the state trajectory, we have P(Aτ|SH) = Qτ n=0 P(An|Sn). Note that by the structure of finite-horizon OS trajectories Aτ is a sequence of continue actions terminated by a stop action at τ. Thus there is a bijective mapping between stopping times and complete action trajectories given by: κ : {0, 1, 2, , H} 7 {1, 01, 001, 00 0 | {z } H 1 0s 1} (16) So P(Aτ|SH) = P(κ(τ)|SH) = P(τ|SH). Consider a trajectory stopping at τ = j and recall that our stochastic stopping policy is defined as φj(Sj) : P(Aj = 1|Sj). Thus, if j = 0: P(τ = 0|SH) = P(A0|SH) = P(A0 = 1|S0) = φ0(S0) (17) If 0 < j < H: P(τ = j|SH) = P(A0 = 0, , Aj 1 = 0, Aj = 1|SH) = P(Aj = 1|SH) n=0 P(An = 0|SH) n=0 (1 φn(Sn)) (18) Finally, if j = H: P(τ = H|SH) = P(A0 = 0, , AH 1 = 0, AH = 1|SH) = P(AH = 1|SH) n=0 P(An = 0|SH) n=0 (1 φn(Sn)) n=0 (1 φn(Sn)) (19) where we have used the fact that in the finite horizon setting φH(SH) := 1 by definition. We may verify PH j=0 P(τ = j|SH) = 1 since: j=0 P(τ = j|SH) = 1 φ0(S0) n=0 (1 φn(Sn)) = (1 φ0(S0)) n=1 (1 φn(Sn)) = (1 φ0(S0))(1 φ1(S1)) n=2 (1 φn(Sn)) n=0 (1 φn(Sn) = P(τ = H|SH) Further, since τ is a stopping time random variable we have: P(τ = j|SH) = P(τ = j|Sj) := ψj(Sj) (20) This is because stopping time random variables have the property that 1(τ = j) is a function of Sj. So we can determine if τ = j or not by only considering Sj making the event {τ = j} conditionally independent of Sj+k, k > 0 given Sj[35]. This completes the proof. A.2 Proof of Theorem 3.1 Proof. We apply Jensen s inequality to the objective JWML(θ) to obtain a lower bound: j=0 ψθ j (sij) rij j=0 wiqij log " ψθ j (sij) rij := J(Q, θ) (21) where Q = [qij] is any row-stochastic matrix satisfying qij 0, i, j and PH j=0 qij = 1. Starting with a given θ(0) the lower bound is maximized (Jensen s inequality becomes an equality) when Q = Q(0) := [q(0) ij ] such that: q(0) ij = ψθ(0) j (sij) rij PH j=0 ψθ(0) j (sij) rij (22) This is the E-step in Theorem 3.1. Note that the E-step does not change the objective, so JWML(θ(0)) = J(Q(0), θ). By ignoring terms that do not depend on θ, maximizing J(Q(0), θ) w.r.t. θ can be seen as equivalent to maximizing j=0 wiqij log ψθ j (sij) (23) This is the M-step. By the lower bound established in equation (21) and since the M-step maximizes J(Q(0), θ) to obtain θ(1), we have JWML(θ(1)) J(Q(0), θ(1)) J(Q(0), θ(0)) = JWML(θ(0)). Therefore a round of E-M results in either an increase or no change in the objective. Since the WML objective is upper-bounded by N log H, the monotone increasing sequence JWML(θ(k)) converges to a local maximum of the objective. A.3 Proof of Corollary 3.1.1 Proof. It suffices to show that the W-step also increases the objective JWML( w, θ). We maximize JWML( w, θ) w.r.t. w, subject to the constraint PN i=1 wi = 1, resulting the following Lagrangian. j=0 ψθ j (sij) rij i=1 wi log wi Taking partial derivative w.r.t. wi and λ and setting to zero, we have: j=0 ψθ j (sij) rij + log ri (1 + log wi) λ = 0 (25) Noting that ri := PH j=0 rij PN i=1 PH j=0 rij and rij := rij PH j=0 rij and simplifying, we have: 1 + λ = log "PH j=0 ψθ j (sij)rij r wi where r = PN i=1 PH j=0 rij. Exponentiation of both sides and cross-multiplication results in: [r exp (1 + λ)] wi = j=0 ψθ j (sij)rij (28) Summing over i, we have: r exp (1 + λ) = j=0 ψθ j (sij)rij (29) Substituting back in equation (28), we finally have: PH j=0 ψθ j (sij)rij PN i=1 PH j=0 ψθ j (sij)rij (30) Thus the maximizing w is exactly the W-step. Therefore, the form of the W-step ensures JWML( w(k), θ(k)) JWML( w(k 1), θ(k)) JWML( w(k 1), θ(k 1)). Thus we have a monotone increasing and bounded sequence JWML( w(k), θ(k)) converging to a local maximum of JWML( w, θ). A.4 Proof of Theorem 4.1 Proof. We use the log-derivative trick [42]. θψθ j (sj) = ψθ j (sj) θ log ψθ j (sj). So : θJOS(θ) = Es H P(s H) j=0 rj θψθ j (sj) = Es H P(s H) j=0 rjψθ j (sj) θ log ψθ j (sj) A.5 Proof of Proposition 4.1 Proof. First, for a given θ(k), substituting for q(k) ij from Theorem 3.1 and w(k) i from the W-step yields: w(k) i q(k) ij = " PH n=0 ψθ(k) n (sin)rin PN m=1 PH n=0 ψθ(k) n (smn)rmn # " ψθ(k) j (sij) rij PH n=0 ψθ(k) n (sin) rin ψθ(k) j (sij) rij " 1 PN m=1 PH n=0 ψθ(k) n (smn)rmn = h ψθ(k) j (sij)rij i " 1 PN m=1 PH n=0 ψθ(k) n (smn)rmn | {z } z(k) Therefore, we may write the M-step objective as : J(k) M (θ) = z(k) N X j=0 v(k) ij log ψθ j (sij) (35) j=0 v(k) ij log ψθ j (sij) = J(k) M (θ) (36) Dropping the iteration index (k), by defining constants vij := ψθ j (sij)rij calculated with the most recent value of θ, the M-step objective is equivalent to the following objective: j=0 vij log ψθ j (sij) (37) where vij := ψθ j (sij)rij is calculated with the most recent value of θ and held constant. Taking the gradient w.r.t θ we have θ JM(θ) = 1 j=0 vij θ log ψθ j (sij) (38) Now, substituting for vij, we have: θ JM(θ) = 1 j=0 rijψθ j (sij) θ log ψθ j (sij) (39) This can be expressed as an expectation over sample trajectories as: θ JM(θ) = Es H P(s H) j=0 rjψθ j (sj) θ log ψθ j (sj) Thus the gradient of the transformed M-step objective is identical to the optimal stopping policy gradient (OSPG). Therefore, if we perform the E-step, W-step and a single gradient update, the sequence of policy parameters θn will exactly correspond to the updated OSPG policy parameters. We may therefore appeal to literature on incremental partial M-step E-M algorithms [27] and gradient descent [21] to conclude that for small enough step-size, that increases JM(θ), the policy updates converge to a local maximum of both JW ML( w, θ) and JOS(θ). A.6 Proof of Corollary 4.1.1 Proof. From the proof of Proposition 4.1, the M-step objective is equivalent to the following objective: j=0 vij log ψθ j (sij) (40) Now, substituting for ψθ j (sij) in terms of the stopping policy using the trajectory reparameterization lemma (Lemma 3.1): JM(θ) JM(θ) = 1 N i=1 vi0 log φθ 0(si0) + log φθ j (sij) + n=0 log(1 φθ n(sin)) n=0 log(1 φθ n(sin)) j=0 vij log φθ j (sij) +vi1 log(1 φθ 0(si0)) +vi2 log(1 φθ 0(si0)) + log(1 φθ 1(si1)) +vi3 log(1 φθ 0(si0)) + log(1 φθ 1(si1)) + log(1 φθ 2(si2)) n=0 log(1 φθ n(sin)) j=0 vij log φθ j (sij) j=0 log(1 φθ j (sij)) j=0 vij log φθ j (sij) + log(1 φθ j (sij)) Since by Proposition 4.1, we have θJOS(θ) = θ JM(θ). Setting kij := h PH n=j+1 vin i in the above expression for JM(θ) and taking the gradient, we have: θJOS(θ) = 1 " vij(1 φθ j (sij)) kijφθ j (sij) φθ j (sij)(1 φθ j (sij)) θφθ j (sij) (41) This completes the proof of Corollary 4.1.1 B Dealing with costs instead of rewards Although one could have used the negative of cost as a reward, this is inconsistent with our Bayesian net model since we require positive rewards to define the reward augmented trajectory model. The following result addresses this issue by transforming a problem with costs to one with a suitable reward specification. Proposition B.1. Given costs cij 0 the following two problems are equivalent: j=0 cijψθ j (sij) arg max θ 1 cij PH n=0 cin Proof. Starting with the cost minimization problem, we may write: j=0 cijψθ j (sij) = arg min θ j=0 cijψθ j (sij) (42) where ci := PH j=0 cij and cij = cij ci . Since PH j=0 ψθ j (sij) = 1, we may subtract the constant 1 N PN i=1 ci PH j=0 ψθ j (sij) from the objective yielding: j=0 cijψθ j (sij) = arg min θ j=0 ( cij 1) ψθ j (sij) = arg max θ j=0 (1 cij) ψθ j (sij) = arg max θ 1 cij PH n=0 cin where we have expanded ci and cij. This completes the proof. C Neural Fitted Q-iteration approaches: DNN-FQI and RNN-FQI Fitted Q-iteration methods [39, 40] use the Wald-Bellman equation (WBE) as follows: First, given parameterized function approximations Kθ j (Sj), a single Wald-Bellman step is bootstrapped, for a batch of trajectories yielding ˆVj(sij) = max{rij, Kθ j (sij)}, j < H, 1 i N. Next the parameters of the continuation function are fit: θ = arg min P j(Kθ j (sij) ˆVj+1(sij+1))2, and the process is iterated. To provide competent DNN/RNN baselines for fitted Q-iteration (FQI) methods that are missing in the literature, we introduce a temporal FQI loss (Algorithm 2). Algorithm 2 Pseudo-code for mini-batch computation of the FQI loss Input: R := [rij], K := Kθ j (sij) {1 i Nb, 0 j H, Nb is batch size} V:,0:H 1 = stop-gradient(max{R:,0:H 1, K:,0:H 1}) {bootstrap WBE to get value targets} V:,H = R:,H {final target is reward for last step} J = MSE(V:,1:H, K0,0:H 1) {next step value is target for current continuation function} D Further details of the experiments All experiments were performed on a shared server configured with 2 Intel Xeon Silver 12core, 2.10 GHz CPUs with 256GB RAM and equipped with 6 NVIDIA 2080Ti GPUs. However, experiments were run on a single GPU at a time and no computation was distributed across GPUs. D.1 Model hyper-parameter settings Table 2 shows general hyper-parameter settings used for all experiments. We apply Batch normalization to the input and outputs of layer activation functions at all hidden layers. Due to the dense correlation structure between assets at each time-step of the American option pricing experiment, we choose the hidden units to be greater than the input dimension d. As with RL policy gradients, we may subtract a baseline value[29] to reduce variance. The OSPG algorithm uses baseline b that does not depend on time-index j, sufficient to guarantee an unbiased OSPG estimator (see Appendix E). In our experiments, we use: j=0 rij (43) Table 2: model hyper-parameter settings method hyper-parameter tuned range value DOS, DNN-FQI, DNN-OSPG num hidden layers n/a 2 RRLSM, RNN-FQI, RNN-OSPG num hidden layers n/a 1 All models hidden layer units n/a 20 All models (Am. Option pricing) hidden layer units n/a 20 + d All models batch size n/a 64 All models (Am. Option pricing) batch size n/a 128 All models learning rate {0.01, 0.001, 0.0001} 0.001 All models epochs early stopping 100 All models batches/epoch n/a 200 All models optimizer n/a Adam RRLSM Kernel noise std n/a 0.0001 RRLSM Recurrent noise std n/a 0.3 D.2 Pricing Bermudan options Table 3 shows model sizes (in trainable parameters), training, and inference times (per time-step). Model sizes grow with input dimension except for RRLSM, which uses an RNN with random, non-trainable weights to extract features from the input. The parameter size of DOS is about an order of magnitude higher than DNN-FQI and DNN-OSPG since parameters in backward induction methods like DOS grow linearly with the number of time-steps (parameters are not shareable across time-steps). Training and inference times are also high since individual models must be fit and inferred at each time-step. Table 3: model sizes and compute times for Bermudan max-call experiment method assets model-size mean training time mean time/prediction (params) (seconds) (µ-seconds) DOS 20 9,225 271 3 DNN-FQI 20 1,047 23 5 DNN-OSPG 20 1,047 29 7 RRLSM 20 198 2 14 RNN-FQI 20 2,807 32 8 RNN-OSPG 20 2,807 37 8 DOS 50 15,165 289 3 DNN-FQI 50 1,707 28 6 DNN-OSPG 50 1,707 29 7 RRLSM 50 198 2 14 RNN-FQI 50 4,667 32 8 RNN-OSPG 50 4,667 32 8 DOS 100 25,065 326 3 DNN-FQI 100 2,807 30 6 DNN-OSPG 100 2,807 28 6 RRLSM 100 198 2 14 RNN-FQI 100 7,767 33 8 RNN-OSPG 100 7,767 30 8 DOS 200 44,865 317 3 DNN-FQI 200 5,007 30 6 DNN-OSPG 200 5,007 27 6 RRLSM 200 198 2 15 RNN-FQI 200 13,967 35 8 RNN-OSPG 200 13,967 30 8 Table 4: American geometric-call option pricing: Results average return (error %) d s0 p LS [23] PDE-DGM [37] PDE-BSDE [7] DNN-FQI [34] DNN-OSPG 7 90 5.9021 5.8440 (0.98%) NA 5.8822 (0.34%) 5.7977 (1.77%) 5.8704 (0.54%) 7 100 10.2591 10.1736 (0.83%) NA 10.2286 (0.30%) 10.1022 (1.53%) 10.2518 (0.07%) 7 110 15.9878 15.8991 (0.55%) NA 15.9738 (0.09%) 15.0487 (5.87%) 15.9699 (0.11%) 13 90 5.7684 5.5962 (3.00%) NA 5.7719 (0.06%) 5.7411 (0.47%) 5.7436 (0.43%) 13 100 10.0984 9.9336 (1.60%) NA 10.1148 (0.16%) 9.9673 (1.30%) 10.0691 (0.29%) 13 110 15.8200 15.6070 (1.40%) NA 15.8259 (0.04%) 14.7759 (6.60%) 15.8107 (0.06%) 20 90 5.7137 5.2023 (9.00%) NA 5.7105 (0.06%) 5.6607 (0.93%) 5.6983 (0.27%) 20 100 10.0326 9.5964 (4.40%) 10.0296 (0.03%) 10.0180 (0.15%) 9.6372 (3.94%) 10.0100 (0.23%) 20 110 15.7513 15.2622 (3.10%) NA 15.7425 (0.06%) 14.9345 (5.19%) 15.7553 (0.03%) 100 90 5.6322 OOM NA 5.6154 (0.30%) 5.3858 (4.38%) 5.6211 (0.20%) 100 100 9.9345 OOM 9.9236 (0.11%) 9.9187 (0.16%) 9.3954 (5.43%) 9.8954 (0.40%) 100 110 15.6491 OOM NA 15.6219 (0.17%) 14.6335 (6.49%) 15.6301 (0.12%) 200 100 9.9222 OOM 9.9004 (0.22%) 9.9088 (0.14%) 9.3772 (5.49%) 9.8991 (0.23%) D.3 Pricing American options The scope of the paper is solving discrete-time, finite-horizon, model-free optimal stopping problems. Bermudan options that have discrete exercise opportunities are one example application. American options, which are more popular, are based on continuous-time asset price evolution and have a continuum of possible exercise times. One way to convert this to our discrete-time setting is to solve related Bermudan options. These options limit exercise opportunities to a fine discrete time grid. State-of-the-art algorithms for pricing American options are based on Partial differential equation (PDE) methods. These methods are model-based since they start with a PDE (such as the multidimensional Black-Scholes Model) defining process evolution. For example, PDE methods often assume Markovian Black-Scholes Dynamics, and the PDEs to be solved require the Black-Scholes model parameters, such as covariance of the Brownian motion, volatility, risk-free interest rate, and dividend yield. In contrast, model-free methods, such as FQI and OSPG algorithms, do not use prior information on the evolution dynamics of the underlying stochastic process. Nevertheless, we compare our model-free OSPG method against state-of-the-art PDE methods such as the Deep Galerkin Method (DGM) [37] and Backward Stochastic Differential Equations method (BSDE)[7] by suitable discretization of the original continuous time-problem. Note that the PDE methods also require discretization of the original PDE (ex, using the Euler-Maruyama scheme) or random sampling (as used in DGM) but do not end up directly solving a Bermudan option. We consider multi-dimensional continuous-time American geometric-average call options with Black Scholes dynamics considered in [37] and [7]. The payoff of these options depends on the price of d underlying assets with multi-dimensional Black-Scholes dynamics and the strike price K. The dynamics are Markovian, with payoff (reward) given by: Sm t = sm 0 exp([r δm σ2 m/2]t + σW m t , Rt = for m = 1, 2, d. r R is the risk-free rate, sm 0 (0, ) represents the initial price, δm [0, ) is the dividend yield, σm (0, ) the volatility and W is a d-dimensional Brownian motion, with instantaneous correlation between its d components given by E[W i t W j t ] = ρijt. The reward for exercise at time t is given by Rt. We discretize t into H + 1 = 100 possible exercise opportunities, using times tj = j T/H for j = 0, 1, H. T is the option expiration duration in years. This yields the stopping problem: sup0 τ H E[Rτ]. The specific option we consider is characterized by the following parameters: K = 100, r = 0.0, σm = 0.25, ρij = 0.75 i = j, δm = 0.02, T = 2. The exact price of this option, p , can be determined semi-analytically for comparison [7]. We generate 10,000 batches (with a batch size of 128) for training and compute option prices on a 3,000-batch test sample. We compare vs. published results from state-of-the-art model-based PDE baselines, including the Deep Galerkin Method (PDE-DGM) [37], the Backward Stochastic Differential Equation (PDE-BSDE) method [7]. We also include published results [7] from the industry standard Longstaff-Schwartz (LS) option Table 5: Stopping a fractional Brownian motion: Results average return (standard deviation) h DOS DNN-FQI DNN-OSPG DOS-ES RRLSM RNN-FQI RNN-OSPG 0.05 0.70 (0.01) 0.87 (0.06) 1.14 (0.00) 1.18 (0.01) 1.16 (0.00) 1.24 (0.04) 1.28 (0.01) 0.10 0.54 (0.01) 0.68 (0.04) 0.92 (0.01) 0.93 (0.01) 0.94 (0.00) 0.99 (0.03) 1.03 (0.02) 0.20 0.34 (0.01) 0.42 (0.08) 0.58 (0.00) 0.55 (0.01) 0.57 (0.00) 0.59 (0.04) 0.64 (0.01) 0.30 0.19 (0.00) 0.14 (0.07) 0.29 (0.10) 0.28 (0.01) 0.29 (0.00) 0.27 (0.08) 0.36 (0.01) 0.40 0.10 (0.01) 0.02 (0.03) 0.13 (0.00) 0.09 (0.01) 0.10 (0.00) 0.04 (0.04) 0.15 (0.00) 0.50 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 0.60 0.08 (0.01) 0.03 (0.03) 0.10 (0.00) 0.08 (0.01) 0.06 (0.00) 0.03 (0.03) 0.10 (0.03) 0.70 0.16 (0.01) 0.08 (0.08) 0.17 (0.02) 0.18 (0.01) 0.19 (0.00) 0.10 (0.07) 0.20 (0.00) 0.80 0.23 (0.01) 0.03 (0.05) 0.25 (0.00) 0.26 (0.01) 0.27 (0.00) 0.18 (0.09) 0.27 (0.01) 0.90 0.31 (0.01) 0.14 (0.12) 0.28 (0.10) 0.32 (0.01) 0.33 (0.00) 0.21 (0.12) 0.33 (0.00) 0.95 0.35 (0.01) 0.01 (0.05) 0.34 (0.01) 0.36 (0.00) 0.36 (0.00) 0.25 (0.08) 0.36 (0.00) pricing algorithm [23], which uses linear approximation with multi-dimensional basis functions, suitable for low dimensional settings. Table 4 summarizes the experiment s results. Published results [7] for LS, DGM, and BSDE are included. NA denotes results not reported in the literature, while OOM indicates out of memory. Our model-free DNN-OSPG algorithm compares favorably with state-of-the-art model-based PDE-based option pricing methods that have prior knowledge of process evolution. Specifically, unlike LS, whose accuracy degrades with the number of correlated assets, and DNN-FQI, whose accuracy degrades up to 6.5%, DNN-OSPG retains excellent performance (< 0.55% error) closely approaching the model-based PDE methods (< 0.35% error). Of course, a key advantage of DNN-OSPG is that it can be used to price options for which no known PDE is available to describe process evolution. D.4 Stopping a fractional Brownian motion Table 5 provides numerical values corresponding to Figure 2. RNN-OSPG dominates competing methods across all values the Hurst parameter in this non-Markovian setting. Table 6: Stopping a fractional Brownian motion: Model sizes and compute times method model-size mean training time mean time/prediction (params) (seconds) (µ-seconds) DOS 62,900 2250 26.9 DNN-FQI 651 21 0.5 DNN-OSPG 651 29 0.5 DOS-ES 276,300 2228 25.9 RRLSM 2,200 9 5.0 RNN-FQI 1,691 14 1.1 RNN-OSPG 1,691 79 1.1 D.5 Early stopping a sequential multi-class classifier We start with the 34 datasets, and corresponding values of α used in the [1] and remove binary classification datasets. This leaves 17 datasets. However, many of these have very few series or have a large number of classes relative to series size, which might make them unsuitable for training RNN classifiers. Nevertheless, we report results on all 17 datasets. Table 7 reports the experiment s mean and standard deviation (over ten random splits) results over 17 UCR time-series datasets. Figure 2 provides a graphical visualization of the same results. N denotes the number of trajectories, H denotes the length of the series, and K denotes the number of classes. α trades-off classification accuracy and earliness. RNN-OSPG achieves the best performance on 15 of the 17 datasets. Table 7: Early stopping a sequential multi-class classifier: Results average cost (standard deviation) Dataset N/H/K/α RRLSM RNN-FQI RNN-OSPG CBF 930/128/3/0.8 0.192 (0.017) 0.308 (0.138) 0.186 (0.011) Chlorine Concentration 4,307/166/3/0.4 0.468 (0.011) 0.523 (0.073) 0.455 (0.003) Crop 24,000/46/24/0.06 0.428 (0.005) 0.407 (0.010) 0.389 (0.008) ECG5000 5000/140/5/0.5 0.110 (0.005) 0.217 (0.059) 0.099 (0.006) Electric Devices 16,637/96/7/0.1 0.257 (0.015) 0.260 (0.012) 0.228 (0.009) Face All 2,250/131/14/0.01 0.108 (0.025) 0.100 (0.010) 0.092 (0.017) Face UCR 2,250/131/14/0.5 0.358 (0.035) 0.465 (0.049) 0.328 (0.028) Fifty Words 905/270/50/0.5 0.667 (0.027) 0.812 (0.041) 0.634 (0.029) Insect Wingbeat Sound 2,200/256/11/1 0.666 (0.064) 0.897 (0.110) 0.646 (0.080) Medical Images 1,141/99/10/0.07 0.296 (0.033) 0.309 (0.031) 0.263 (0.038) Melbourne Pedestrian 3,633/24/10/0.8 0.591 (0.065) 0.597 (0.068) 0.487 (0.044) Mixed Shapes Regular Train 2,925/1024/5/0.1 0.194 (0.024) 0.194 (0.008) 0.191 (0.037) Non Invasive Fetal ECGThorax2 3,765/750/42/0.04 0.418 (0.094) 0.343 (0.008) 0.295 (0.064) Star Light Curves 9,236/1024/3/0.3 0.120 (0.043) 0.315 (0.067) 0.165 (0.035) Symbols 1,020/398/6/0.2 0.103 (0.013) 0.208 (0.030) 0.239 (0.070) UWave Gesture Library X 4,478/315/8/0.5 0.523 (0.023) 0.658 (0.026) 0.498 (0.040) Word Synonyms 905/270/25/0.6 0.760 (0.024) 0.913 (0.043) 0.727 (0.041) E Baseline subtraction for variance reduction As with RL policy gradients, we may subtract a baseline value[29] to reduce variance. However, unlike the general RL case, due to the stopping-time-based formulation of the OSPG, the OSPG baseline should not be time-varying. Proposition E.1 (baseline subtraction for OSPG). The optimal stopping policy gradient (OSPG) of Theorem 4.1 is invariant to the subtraction of a constant (w.r.t. trajectory) baseline from every reward in the trajectory. Thus: θJOS(θ) = Es H P(s H) j=0 rjψθ j (sj) θ log ψθ j (sj) = Es H P(s H) j=0 (rj b)ψθ j (sj) θ log ψθ j (sj) Proof. It suffices to show Es H P(s H) h PH j=0 bψθ j (sj) θ log ψθ j (sj) i = 0. We proceed as follows: Es H P(s H) j=0 bψθ j (sj) θ log ψθ j (sj) = Es H P(s H) j=0 bψθ j (sj) θψθ j (sj) = Es H P(s H) j=0 bψθ j (sj) = Es H P(s H) j=0 ψθ j (sj)