# deep_optimal_stopping__97ac91d4.pdf Journal of Machine Learning Research 20 (2019) 1-25 Submitted 4/18; Revised 1/19; Published 4/19 Deep Optimal Stopping Sebastian Becker sebastian.becker@zenai.ch Zenai AG, 8045 Zurich, Switzerland Patrick Cheridito patrick.cheridito@math.ethz.ch Risk Lab, Department of Mathematics ETH Zurich, 8092 Zurich, Switzerland Arnulf Jentzen arnulf.jentzen@sam.math.ethz.ch SAM, Department of Mathematics ETH Zurich, 8092 Zurich, Switzerland Editor: Jon Mc Auliffe In this paper we develop a deep learning method for optimal stopping problems which directly learns the optimal stopping rule from Monte Carlo samples. As such, it is broadly applicable in situations where the underlying randomness can efficiently be simulated. We test the approach on three problems: the pricing of a Bermudan max-call option, the pricing of a callable multi barrier reverse convertible and the problem of optimally stopping a fractional Brownian motion. In all three cases it produces very accurate results in highdimensional situations with short computing times. Keywords: optimal stopping, deep learning, Bermudan option, callable multi barrier reverse convertible, fractional Brownian motion 1. Introduction We consider optimal stopping problems of the form supτ E g(τ, Xτ), where X = (Xn)N n=0 is an Rd-valued discrete-time Markov process and the supremum is over all stopping times τ based on observations of X. Formally, this just covers situations where the stopping decision can only be made at finitely many times. But practically all relevant continuoustime stopping problems can be approximated with time-discretized versions. The Markov assumption means no loss of generality. We make it because it simplifies the presentation and many important problems already are in Markovian form. But every optimal stopping problem can be made Markov by including all relevant information from the past in the current state of X (albeit at the cost of increasing the dimension of the problem). In theory, optimal stopping problems with finitely many stopping opportunities can be solved exactly. The optimal value is given by the smallest supermartingale that dominates the reward process the so-called Snell envelope and the smallest (largest) optimal stopping time is the first time the immediate reward dominates (exceeds) the continuation value; see, e.g., Peskir and Shiryaev (2006) or Lamberton and Lapeyre (2008). However, traditional numerical methods suffer from the curse of dimensionality. For instance, the complexity of standard treeor lattice-based methods increases exponentially in the dimension. For typical problems they yield good results for up to three dimensions. To c 2019 Sebastian Becker, Patrick Cheridito and Arnulf Jentzen. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/18-232.html. Becker, Cheridito and Jentzen treat higher-dimensional problems, various Monte Carlo based have been developed over the last years. A common approach consists in estimating continuation values to either derive stopping rules or recursively approximate the Snell envelope; see, e.g., Tilley (1993), Barraquand and Martineau (1995), Carriere (1996), Longstaffand Schwartz (2001), Tsitsiklis and Van Roy (2001), Boyle et al. (2003), Broadie and Glasserman (2004), Bally et al. (2005), Kolodko and Schoenmakers (2006), Egloffet al. (2007), Berridge and Schumacher (2008), Jain and Oosterlee (2015), Belomestny et al. (2018) or Haugh and Kogan (2004) and Kohler et al. (2010), which use neural networks with one hidden layer to do this. A different strand of the literature has focused on approximating optimal exercise boundaries; see, e.g., Andersen (2000), Garc ıa (2003) and Belomestny (2011). Based on an idea of Davis and Karatzas (1994), a dual approach was developed by Rogers (2002) and Haugh and Kogan (2004); see Jamshidian (2007) and Chen and Glasserman (2007) for a multiplicative version and Andersen and Broadie (2004), Broadie and Cao (2008), Belomestny et al. (2009), Rogers (2010), Desai et al. (2012), Belomestny (2013), Belomestny et al. (2013) and Lelong (2016) for extensions and primal-dual methods. In Sirignano and Spiliopoulos (2018) optimal stopping problems in continuous time are treated by approximating the solutions of the corresponding free boundary PDEs with deep neural networks. In this paper we use deep learning to approximate an optimal stopping time. Our approach is related to policy optimization methods used in reinforcement learning (Sutton and Barto, 1998), deep reinforcement learning (Schulman et al., 2015; Mnih et al., 2015; Silver et al., 2016; Lillicrap et al., 2016) and the deep learning method for stochastic control problems proposed by Han and E (2016). However, optimal stopping differs from the typical control problems studied in this literature. The challenge of our approach lies in the implementation of a deep learning method that can efficiently learn optimal stopping times. We do this by decomposing an optimal stopping time into a sequence of 0-1 stopping decisions and approximating them recursively with a sequence of multilayer feedforward neural networks. We show that our neural network policies can approximate optimal stopping times to any degree of desired accuracy. A candidate optimal stopping time ˆτ can be obtained by running a stochastic gradient ascent. The corresponding expectation E g(ˆτ, Xˆτ) provides a lower bound for the optimal value supτ E g(τ, Xτ). Using a version of the dual method of Rogers (2002) and Haugh and Kogan (2004), we also derive an upper bound. In all our examples, both bounds can be computed with short run times and lie close together. The rest of the paper is organized as follows: In Section 2 we introduce the setup and explain our method of approximating optimal stopping times with neural networks. In Section 3 we construct lower bounds, upper bounds, point estimates and confidence intervals for the optimal value. In Section 4 we test the approach on three examples: the pricing of a Bermudan max-call option on different underlying assets, the pricing of a callable multi barrier reverse convertible and the problem of optimally stopping a fractional Brownian motion. In the first two examples, we use a multi-dimensional Black Scholes model to describe the dynamics of the underlying assets. Then the pricing of a Bermudan maxcall option amounts to solving a d-dimensional optimal stopping problem, where d is the number of assets. We provide numerical results for d = 2, 3, 5, 10, 20, 30, 50, 100, 200 and 500. In the case of a callable MBRC, it becomes a d+1-dimensional stopping problem since one also needs to keep track of the barrier event. We present results for d = 2, 3, 5, 10, 15 and 30. In the third example we only consider a one-dimensional fractional Brownian Deep Optimal Stopping motion. But fractional Brownian motion is not Markov. In fact, all of its increments are correlated. So, to optimally stop it, one has to keep track of all past movements. To make it tractable, we approximate the continuous-time problem with a time-discretized version, which if formulated as a Markovian problem, has as many dimensions as there are time-steps. We compute a solution for 100 time-steps. 2. Deep Learning Optimal Stopping Rules Let X = (Xn)N n=0 be an Rd-valued discrete-time Markov process on a probability space (Ω, F, P), where N and d are positive integers. We denote by Fn the σ-algebra generated by X0, X1, . . . , Xn and call a random variable τ : Ω {0, 1, . . . , N} an X-stopping time if the event {τ = n} belongs to Fn for all n {0, 1, . . . , N}. Our aim is to develop a deep learning method that can efficiently learn an optimal policy for stopping problems of the form sup τ T E g(τ, Xτ), (1) where g: {0, 1, . . . , N} Rd R is a measurable function and T denotes the set of all X-stopping times. To make sure that problem (1) is well-defined and admits an optimal solution, we assume that g satisfies the integrability condition E |g(n, Xn)| < for all n {0, 1, . . . , N} ; (2) see, e.g., Peskir and Shiryaev (2006) or Lamberton and Lapeyre (2008). To be able to derive confidence intervals for the optimal value (1), we will have to make the slightly stronger assumption E g(n, Xn)2 < for all n {0, 1, . . . , N} (3) in Subsection 3.3 below. This is satisfied in all our examples in Section 4. 2.1. Expressing Stopping Times in Terms of Stopping Decisions Any X-stopping time can be decomposed into a sequence of 0-1 stopping decisions. In principle, the decision whether to stop the process at time n if it has not been stopped before, can be made based on the whole evolution of X from time 0 until n. But to optimally stop the Markov process X, it is enough to make stopping decisions according to fn(Xn) for measurable functions fn : Rd {0, 1}, n = 0, 1, . . . , N. Theorem 1 below extends this well-known fact and serves as the theoretical basis of our method. Consider the auxiliary stopping problems Vn = sup τ Tn E g(τ, Xτ) (4) for n = 0, 1, . . . , N, where Tn is the set of all X-stopping times satisfying n τ N. Obviously, TN consists of the unique element τN N, and one can write τN = Nf N(XN) for the constant function f N 1. Moreover, for given n {0, 1, . . . , N} and a sequence of measurable functions fn, fn+1, . . . , f N : Rd {0, 1} with f N 1, m=n mfm(Xm) j=n (1 fj(Xj)) (5) Becker, Cheridito and Jentzen defines1 a stopping time in Tn. The following result shows that, for our method of recursively computing an approximate solution to the optimal stopping problem (1), it will be sufficient to consider stopping times of the form (5). Theorem 1 For a given n {0, 1, . . . , N 1}, let τn+1 be a stopping time in Tn+1 of the form m=n+1 mfm(Xm) j=n+1 (1 fj(Xj)) (6) for measurable functions fn+1, . . . , f N : Rd {0, 1} with f N 1. Then there exists a measurable function fn : Rd {0, 1} such that the stopping time τn Tn given by (5) satisfies E g(τn, Xτn) Vn Vn+1 E g(τn+1, Xτn+1) , where Vn and Vn+1 are the optimal values defined in (4). Proof Denote ε = Vn+1 E g(τn+1, Xτn+1), and consider a stopping time τ Tn. By the Doob Dynkin lemma (see, e.g., Aliprantis and Border, 2006, Theorem 4.41), there exists a measurable function hn : Rd R such that hn(Xn) is a version of the conditional expectation E g(τn+1, Xτn+1) | Xn . Moreover, due to the special form (6) of τn+1, g(τn+1, Xτn+1) = m=n+1 g(m, Xm)1{τn+1=m} = m=n+1 g(m, Xm)1{fm(Xm) Qm 1 j=n+1(1 fj(Xj))=1} is a measurable function of Xn+1, . . . , XN. So it follows from the Markov property of X that hn(Xn) is also a version of the conditional expectation E g(τn+1, Xτn+1) | Fn . Since the events D = {g(n, Xn) hn(Xn)} and E = {τ = n} are in Fn, τn = n1D + τn+11Dc belongs to Tn and τ = τn+11E + τ1Ec to Tn+1. It follows from the definitions of Vn+1 and ε that E g(τn+1, Xτn+1) = Vn+1 ε E g( τ, X τ) ε. Hence, E g(τn+1, Xτn+1)1Ec E[g( τ, X τ)1Ec] ε = E[g(τ, Xτ)1Ec] ε, from which one obtains E g(τn, Xτn) = E g(n, Xn)ID + g(τn+1, Xτn+1)IDc = E[g(n, Xn)ID + hn(Xn)IDc] E[g(n, Xn)IE + hn(Xn)IEc] = E g(n, Xn)IE + g(τn+1, Xτn+1)IEc E[g(n, Xn)IE + g(τ, Xτ)IEc] ε = E g(τ, Xτ) ε. Since τ Tn was arbitrary, this shows that E g(τn, Xτn) Vn ε. Moreover, one has 1D = fn(Xn) for the function fn : Rd {0, 1} given by ( 1 if g(n, x) hn(x) 0 if g(n, x) < hn(x) . 1. In expressions of the form (5), we understand the empty product Qn 1 j=n (1 fj(Xj)) as 1. Deep Optimal Stopping τn = nfn(Xn) + τn+1(1 fn(Xn)) = m=n mfm(Xm) j=n (1 fj(Xj)), which concludes the proof. Remark 2 Since for f N 1, the stopping time τN = f N(XN) is optimal in TN, Theorem 1 inductively yields measurable functions fn : Rd {0, 1} such that for all n {0, 1, . . . , N 1}, the stopping time τn given by (5) is optimal among Tn. In particular, n=1 nfn(Xn) j=0 (1 fj(Xj)) (7) is an optimal stopping time for problem (1). Remark 3 In many applications, the Markov process X starts from a deterministic initial value x0 Rd. Then the function f0 enters the representation (7) only through the value f0(x0) {0, 1}; that is, at time 0, only a constant and not a whole function has to be learned. 2.2. Neural Network Approximation Our numerical method for problem (1) consists in iteratively approximating optimal stopping decisions fn : Rd {0, 1}, n = 0, 1, . . . , N 1, by a neural network fθ : Rd {0, 1} with parameter θ Rq. We do this by starting with the terminal stopping decision f N 1 and proceeding by backward induction. More precisely, let n {0, 1, . . . , N 1}, and assume parameter values θn+1, θn+2, . . . , θN Rq have been found such that fθN 1 and the stopping time m=n+1 mfθm(Xm) j=n+1 (1 fθj(Xj)) produces an expected value E g(τn+1, Xτn+1) close to the optimum Vn+1. Since fθ takes values in {0, 1}, it does not directly lend itself to a gradient-based optimization method. So, as an intermediate step, we introduce a feedforward neural network F θ : Rd (0, 1) of the form F θ = ψ aθ I ϕq I 1 aθ I 1 ϕq1 aθ 1, I, q1, q2, . . . , q I 1 are positive integers specifying the depth of the network and the number of nodes in the hidden layers (if there are any), aθ 1 : Rd Rq1, . . . , aθ I 1 : Rq I 2 Rq I 1 and aθ I : Rq I 1 R are affine functions, for j N, ϕj : Rj Rj is the component-wise Re LU activation function given by ϕj(x1, . . . , xj) = (x+ 1 , . . . , x+ j ) Becker, Cheridito and Jentzen ψ: R (0, 1) is the standard logistic function ψ(x) = ex/(1 + ex) = 1/(1 + e x). The components of the parameter θ Rq of F θ consist of the entries of the matrices A1 Rq1 d, . . . , AI 1 Rq I 1 q I 2, AI R1 q I 1 and the vectors b1 Rq1, . . . , b I 1 Rq I 1, b I R given by the representation of the affine functions aθ i (x) = Aix + bi, i = 1, . . . , I. So the dimension of the parameter space is ( d + 1 if I = 1 1 + q1 + + q I 1 + dq1 + + q I 2q I 1 + q I 1 if I 2, and for given x Rd, F θ(x) is continuous as well as almost everywhere smooth in θ. Our aim is to determine θn Rq so that E h g(n, Xn)F θn(Xn) + g(τn+1, Xτn+1)(1 F θn(Xn)) i is close to the supremum supθ Rq E g(n, Xn)F θ(Xn) + g(τn+1, Xτn+1)(1 F θ(Xn)) . Once this has been achieved, we define the function fθn : Rd {0, 1} by fθn = 1[0, ) aθn I ϕq I 1 aθn I 1 ϕq1 aθn 1 , (8) where 1[0, ) : R {0, 1} is the indicator function of [0, ). The only difference between F θn and fθn is the final nonlinearity. While F θn produces a stopping probability in (0, 1), the output of fθn is a hard stopping decision given by 0 or 1, depending on whether F θn takes a value below or above 1/2. The following result shows that for any depth I 2, a neural network of the form (8) is flexible enough to make almost optimal stopping decisions provided it has sufficiently many nodes. Proposition 4 Let n {0, 1, . . . , N 1} and fix a stopping time τn+1 Tn+1. Then, for every depth I 2 and constant ε > 0, there exist positive integers q1, . . . , q I 1 such that sup θ Rq E h g(n, Xn)fθ(Xn) + g(τn+1, Xτn+1)(1 fθ(Xn)) i sup f D E g(n, Xn)f(Xn) + g(τn+1, Xτn+1)(1 f(Xn)) ε, where D is the set of all measurable functions f : Rd {0, 1}. Proof Fix ε > 0. It follows from the integrability condition (2) that there exists a measurable function f : Rd {0, 1} such that E h g(n, Xn) f(Xn) + g(τn+1, Xτn+1)(1 f(Xn)) i sup f D E g(n, Xn)f(Xn) + g(τn+1, Xτn+1)(1 f(Xn)) ε/4. (9) Deep Optimal Stopping f can be written as f = 1A for the Borel set A = {x Rd : f(x) = 1}. Moreover, by (2), B 7 E[|g(n, Xn)|1B(Xn)] and B 7 E |g(τn+1, Xτn+1)|1B(Xn) define finite Borel measures on Rd. Since every finite Borel measure on Rd is tight (see, e.g., Aliprantis and Border, 2006), there exists a compact (possibly empty) subset K A such that E g(n, Xn)1K(Xn) + g(τn+1, Xτn+1)(1 1K(Xn)) E h g(n, Xn) f(Xn) + g(τn+1, Xτn+1)(1 f(Xn)) i ε/4. (10) Let ρK : Rd [0, ] be the distance function given by ρK(x) = infy K x y 2. Then kj(x) = max {1 jρK(x), 1} , j N, defines a sequence of continuous functions kj : Rd [ 1, 1] that converge pointwise to 1K 1Kc. So it follows from Lebesgue s dominated convergence theorem that there exists a j N such that E h g(n, Xn) 1{kj(Xn) 0} + g(τn+1, Xτn+1)(1 1{kj(Xn) 0}) i E g(n, Xn)1K(Xn) + g(τn+1, Xτn+1)(1 1K(Xn)) ε/4. (11) By Theorem 1 of Leshno et al. (1993), kj can be approximated uniformly on compacts by functions of the form r X i=1 (v T i x + ci)+ i=1 (w T i x + di)+ (12) for r, s N, v1, . . . , vr, w1, . . . , ws Rd and c1, . . . , cr, d1, . . . , ds R. So there exists a function h: Rd R expressible as in (12) such that E g(n, Xn) 1{h(Xn) 0} + g(τn+1, Xτn+1)(1 1{h(Xn) 0}) E h g(n, Xn) 1{kj(Xn) 0} + g(τn+1, Xτn+1)(1 1{kj(Xn) 0}) i ε/4. (13) Now note that for any integer I 2, the composite mapping 1[0, ) h can be written as a neural net fθ of the form (8) with depth I for suitable integers q1, . . . , q I 1 and parameter value θ Rq. Hence, one obtains from (9), (10), (11) and (13) that E h g(n, Xn) fθ(Xn) + g(τn+1, Xτn+1)(1 fθ(Xn)) i sup f D E g(n, Xn)f(Xn) + g(τn+1, Xτn+1)(1 f(Xn)) ε, and the proof is complete. We always choose θN Rq such that2 fθN 1. Then our candidate optimal stopping time n=1 nfθn(Xn) j=0 (1 fθj(Xj)) (14) 2. It is easy to see that this is possible. Becker, Cheridito and Jentzen is specified by the vector Θ = (θ0, θ1, . . . , θN 1) RNq. The following is an immediate consequence of Theorem 1 and Proposition 4: Corollary 5 For a given optimal stopping problem of the form (1), a depth I 2 and a constant ε > 0, there exist positive integers q1, . . . , q I 1 and a vector Θ RNq such that the corresponding stopping time (14) satisfies E g(τ Θ, Xτ Θ) supτ T E g(τ, Xτ) ε. 2.3. Parameter Optimization We train neural networks of the form (8) with fixed depth I 2 and given numbers q1, . . . , q I 1 of nodes in the hidden layers3. To numerically find parameters θn Rq yielding good stopping decisions fθn for all times n {0, 1, . . . , N 1}, we approximate expected values with averages of Monte Carlo samples calculated from simulated paths of the process (Xn)N n=0. Let (xk n)N n=0, k = 1, 2, . . . be independent realizations of such paths. We choose θN Rq such that fθN 1 and determine determine θn Rq for n N 1 recursively. So, suppose that for a given n {0, 1, . . . , N 1}, parameters θn+1, . . . , θN Rq, have been found so that the stopping decisions fθn+1, . . . , fθN generate a stopping time m=n+1 mfθm(Xm) j=n+1 (1 fθj(Xj)) with corresponding expectation E g(τn+1, Xτn+1) close to the optimal value Vn+1. If n = N 1, one has τn+1 N, and if n N 2, τn+1 can be written as τn+1 = ln+1(Xn+1, . . . , XN 1) for a measurable function ln+1 : Rd(N n 1) {n + 1, n + 2, . . . , N}. Accordingly, denote ( N if n = N 1 ln+1(xk n+1, . . . , xk N 1) if n N 2 . If at time n, one applies the soft stopping decision F θ and afterward behaves according to fθn+1, . . . , fθN , the realized reward along the k-th simulated path of X is rk n(θ) = g(n, xk n)F θ(xk n) + g(lk n+1, xk lk n+1)(1 F θ(xk n)). For large K N, 1 K k=1 rk n(θ) (15) approximates the expected value E h g(n, Xn)F θ(Xn) + g(τn+1, Xτn+1)(1 F θ(Xn)) i . 3. For a given application, one can try out different choices of I and q1, . . . , q I 1 to find a suitable trade-off between accuracy and efficiency. Alternatively, the determination of I and q1, . . . , q I 1 could be built into the training algorithm. Deep Optimal Stopping Since rk n(θ) is almost everywhere differentiable in θ, a stochastic gradient ascent method can be applied to find an approximate optimizer θn Rq of (15). The same simulations (xk n)N n=0, k = 1, 2, . . . can be used to train the stopping decisions fθn at all times n {0, 1, . . . , N 1}. In the numerical examples in Section 4 below, we employed mini-batch gradient ascent with Xavier initialization (Glorot and Bengio, 2010), batch normalization (Ioffe and Szegedy, 2015) and Adam updating (Kingma and Ba, 2015). Remark 6 If the Markov process X starts from a deterministic initial value x0 Rd, the initial stopping decision is given by a constant f0 {0, 1}. To learn f0 from simulated paths of X, it is enough to compare the initial reward g(0, x0) to a Monte Carlo estimate ˆC of E g(τ1, Xτ1), where τ1 T1 is of the form n=1 nfθn(Xn) j=1 (1 fθj(Xj)) for fθN 1 and trained parameters θ1, . . . , θN 1 Rq. Then one sets f0 = 1 (that is, stop immediately) if g(0, x0) ˆC and f0 = 0 (continue) otherwise. The resulting stopping time is of the form ( 0 if f0 = 1 τ1 if f0 = 0. 3. Bounds, Point Estimates and Confidence Intervals In this section we derive lower and upper bounds as well as point estimates and confidence intervals for the optimal value V0 = supτ T E g(τ, Xτ). 3.1. Lower Bound Once the stopping decisions fθn have been trained, the stopping time τ Θ given by (14) yields a lower bound L = E g(τ Θ, Xτ Θ) for the optimal value V0 = supτ T E g(τ, Xτ). To estimate it, we simulate a new set4 of independent realizations (yk n)N n=0, k = 1, 2, . . . , KL, of (Xn)N n=0. τ Θ is of the form τ Θ = l(X0, . . . , XN 1) for a measurable function l: Rd N {0, 1, . . . , N}. Denote lk = l(yk 0, . . . , yk N 1). The Monte Carlo approximation k=1 g(lk, yk lk) gives an unbiased estimate of the lower bound L, and by the law of large numbers, ˆL converges to L for KL . 4. In particular, we assume that the samples (yk n)N n=0, k = 1, . . . , KL, are drawn independently from the realizations (xk n)N n=0, k = 1, . . . , K, used in the training of the stopping decisions. Becker, Cheridito and Jentzen 3.2. Upper Bound The Snell envelope of the reward process (g(n, Xn))N n=0 is the smallest5 supermartingale with respect to (Fn)N n=0 that dominates (g(n, Xn))N n=0. It is given6 by Hn = ess supτ Tn E[g(τ) | Fn], n = 0, 1, . . . , N; see, e.g., Peskir and Shiryaev (2006) or Lamberton and Lapeyre (2008). Its Doob Meyer decomposition is Hn = H0 + MH n AH n , where MH is the (Fn)-martingale given6 by MH 0 = 0 and MH n MH n 1 = Hn E[Hn | Fn 1], n = 1, . . . , N, and AH is the nondecreasing (Fn)-predictable process given6 by AH 0 = 0 and AH n AH n 1 = Hn 1 E[Hn | Fn 1], n = 1, . . . , N. Our estimate of an upper bound for the optimal value V0 is based on the following variant7 of the dual formulation of optimal stopping problems introduced by Rogers (2002) and Haugh and Kogan (2004). Proposition 7 Let (εn)N n=0 be a sequence of integrable random variables on (Ω, F, P). Then V0 E max 0 n N g(n, Xn) MH n εn + E min 0 n N AH n + εn . (16) Moreover, if E[εn | Fn] = 0 for all n {0, 1, . . . , N}, one has V0 E max 0 n N (g(n, Xn) Mn εn) (17) for every (Fn)-martingale (Mn)N n=0 starting from 0. Proof First, note that E max 0 n N g(n, Xn) MH n εn E max 0 n N Hn MH n εn = E max 0 n N H0 AH n εn = V0 E min 0 n N AH n + εn , which shows (16). Now, assume that E[εn | Fn] = 0 for all n {0, 1, . . . , N}, and let τ be an X-stopping time. Then n=0 1{τ=n}εn n=0 1{τ=n}E[εn | Fn] 5. in the P-almost sure order 6. up to P-almost sure equality 7. See also the discussion on noisy estimates in Andersen and Broadie (2004). Deep Optimal Stopping So one obtains from the optional stopping theorem (see, e.g., Grimmett and Stirzaker, 2001), E g(τ, Xτ) = E[g(τ, Xτ) Mτ ετ] E max 0 n N (g(n, Xn) Mn εn) for every (Fn)-martingale (Mn)N n=0 starting from 0. Since V0 = supτ T E g(τ, Xτ), this implies (17). For every (Fn)-martingale (Mn)N n=0 starting from 0 and each sequence of integrable error terms (εn)N n=0 satisfying E[εn | Fn] = 0 for all n, the right side of (17) provides an upper bound8 for V0, and by (16), this upper bound is tight if M = MH and ε 0. So we try to use our candidate optimal stopping time τ Θ to construct a martingale close to MH. The closer τ Θ is to an optimal stopping time, the better the value process9 HΘ n = E h g(τ Θ n , Xτ Θ n ) | Fn i , n = 0, 1, . . . , N, corresponding to m=n mfθm(Xm) j=n (1 fθj(Xj)), n = 0, 1, . . . , N, approximates the Snell envelope (Hn)N n=0. The martingale part of (HΘ n )N n=0 is given by MΘ 0 = 0 and MΘ n MΘ n 1 = HΘ n E HΘ n | Fn 1 = fθn(Xn)g(n, Xn) + (1 fθn(Xn))CΘ n CΘ n 1, n 1, (18) for the continuation values10 CΘ n = E[g(τ Θ n+1, Xτ Θ n+1) | Fn] = E[g(τ Θ n+1, Xτ Θ n+1) | Xn], n = 0, 1, . . . , N 1. Note that CΘ N does not have to be specified. It formally appears in (18) for n = N. But (1 fθN (XN)) is always 0. To estimate MΘ, we generate a third set11 of independent realizations (zk n)N n=0, k = 1, 2, . . . , KU, of (Xn)N n=0. In addition, for every zk n, we simulate J continuation paths zk,j n+1, . . . , zk,j N , j = 1, . . . , J, that are conditionally independent12 of each 8. Note that for the right side of (17) to be a valid upper bound, it is sufficient that E[εn | Fn] = 0 for all n. In particular, ε0, ε1, . . . , εN can have any arbitrary dependence structure. 9. Again, since HΘ n , M Θ n and CΘ n are given by conditional expectations, they are only specified up to P-almost sure equality. 10. The two conditional expectations are equal since (Xn)N n=0 is Markov and τ Θ n+1 only depends on (Xn+1, . . . , XN 1). 11. The realizations (zk n)N n=0, k = 1, . . . , KU, must be drawn independently of (xk n)N n=0, k = 1, . . . , K, so that our estimate of the upper bound does not depend on the samples used to train the stopping decisions. But theoretically, they can depend on (yk n)N n=0, k = 1, . . . , KL, without affecting the unbiasedness of the estimate ˆU or the validity of the confidence interval derived in Subsection 3.3 below. 12. More precisely, the tuples ( zk,j n+1, . . . , zk,j N ), j = 1, . . . , J, are simulated according to pn(zk n, ), where pn is a transition kernel from Rd to R(N n)d such that pn(Xn, B) = P[(Xn+1, . . . , XN) B | Xn] P-almost surely for all Borel sets B R(N n)d. We generate them independently of each other across j and k. On the other hand, the continuation paths starting from zk n do not have to be drawn independently of those starting from zk n for n = n . Becker, Cheridito and Jentzen other and of zk n+1, . . . , zk N. Let us denote by τ k,j n+1 the value of τ Θ n+1 along zk,j n+1, . . . , zk,j N . Estimating the continuation values as j=1 g τ k,j n+1, zk,j τ k,j n+1 , n = 0, 1, . . . , N 1, yields the noisy estimates Mk n = fθn(zk n)g(n, zk n) + (1 fθn(zk n))Ck n Ck n 1 of the increments MΘ n MΘ n 1 along the k-th simulated path zk 0, . . . , zk N. So ( 0 if n = 0 Pn m=1 Mk m if n 1 can be viewed as realizations of MΘ n + εn for estimation errors εn with standard deviations proportional to 1/ J such that E[εn | Fn] = 0 for all n. Accordingly, k=1 max 0 n N g n, zk n Mk n , is an unbiased estimate of the upper bound U = E max 0 n N g(n, Xn) MΘ n εn , which, by the law of large numbers, converges to U for KU . 3.3. Point Estimate and Confidence Intervals Our point estimate of V0 is the average To derive confidence intervals, we assume that g(n, Xn) is square-integrable13 for all n. Then g(τ θ, Xτ Θ) and max 0 n N g(n, Xn) MΘ n εn are square-integrable too. Hence, one obtains from the central limit theorem that for large KL, ˆL is approximately normally distributed with mean L and variance ˆσ2 L/KL for ˆσ2 L = 1 KL 1 g(lk, yk lk) ˆL 2 . 13. See condition (3). Deep Optimal Stopping So, for every α (0, 1], ˆL zα/2 ˆσL is an asymptotically valid 1 α/2 confidence interval for L, where zα/2 is the 1 α/2 quantile of the standard normal distribution. Similarly, , ˆU + zα/2 ˆσU with ˆσ2 U = 1 KU 1 g n, zk n Mk n ˆU 2 , is an asymptotically valid 1 α/2 confidence interval for U. It follows that for every constant ε > 0, one has P V0 < ˆL zα/2 ˆσL KL or V0 > ˆU + zα/2 ˆσU P L < ˆL zα/2 ˆσL + P U > ˆU + zα/2 ˆσU as soon as KL and KU are large enough. In particular, ˆL zα/2 ˆσL KL , ˆU + zα/2 ˆσU is an asymptotically valid 1 α confidence interval for V0. 4. Examples In this section we test14 our method on three examples: the pricing of a Bermudan maxcall option, the pricing of a callable multi barrier reverse convertible and the problem of optimally stopping a fractional Brownian motion. 4.1. Bermudan Max-Call Options Bermudan max-call options are one of the most studied examples in the numerics literature on optimal stopping problems (see, e.g., Longstaffand Schwartz, 2001; Rogers, 2002; Garc ıa, 2003; Boyle et al., 2003; Haugh and Kogan, 2004; Broadie and Glasserman, 2004; Andersen and Broadie, 2004; Broadie and Cao, 2008; Berridge and Schumacher, 2008; Belomestny, 2011, 2013; Jain and Oosterlee, 2015; Lelong, 2016). Their payoffdepends on the maximum of d underlying assets. Assume the risk-neutral dynamics of the assets are given by a multi-dimensional Black Scholes model15 Si t = si 0 exp [r δi σ2 i /2]t + σi W i t , i = 1, 2, . . . , d, (20) 14. All computations were performed in single precision (float32) on a NVIDIA Ge Force GTX 1080 GPU with 1974 MHz core clock and 8 GB GDDR5X memory with 1809.5 MHz clock rate. The underlying system consisted of an Intel Core i7-6800K 3.4 GHz CPU with 64 GB DDR4-2133 memory running Tensorflow 1.11 on Ubuntu 16.04. 15. We make this assumption so that we can compare our results to those obtained with different methods in the literature. But our approach works for any asset dynamics as long as it can efficiently be simulated. Becker, Cheridito and Jentzen for initial values si 0 (0, ), a risk-free interest rate r R, dividend yields δi [0, ), volatilities σi (0, ) and a d-dimensional Brownian motion W with constant instantaneous correlations16 ρij R between different components W i and W j. A Bermudan max-call option on S1, S2, . . . , Sd has payoff max1 i d Si t K + and can be exercised at any point of a time grid 0 = t0 < t1 < < t N. Its price is given by e rτ max 1 i d Si τ K +# where the supremum is over all S-stopping times taking values in {t0, t1, . . . , t N} (see, e.g., Schweizer, 2002). Denote Xi n = Si tn, n = 0, 1, . . . , N, and let T be the set of X-stopping times. Then the price can be written as supτ T E g(τ, Xτ) for g(n, x) = e rtn max 1 i d xi K + , and it is straight-forward to simulate (Xn)N n=0. In the following we assume the time grid to be of the form tn = n T/N, n = 0, 1, . . . , N, for a maturity T > 0 and N + 1 equidistant exercise dates. Even though g(n, Xn) does not carry any information that is not already contained in Xn, our method worked more efficiently when we trained the optimal stopping decisions on Monte Carlo simulations of the d + 1-dimensional Markov process (Yn)N n=0 = (Xn, g(n, Xn))N n=0 instead of (Xn)N n=0. Since Y0 is deterministic, we first trained stopping times τ1 T1 of the form n=1 nfθn(Yn) j=1 (1 fθj(Yk)) for fθN 1 and fθ1, . . . , fθN 1 : Rd+1 {0, 1} given by (8) with I = 2 and q1 = q2 = d+40. Then we determined our candidate optimal stopping times as ( 0 if f0 = 1 τ1 if f0 = 0 for a constant f0 {0, 1} depending17 on whether it was optimal to stop immediately at time 0 or not (see Remark 6 above). It is straight-forward to simulate from model (20). We conducted 3,000+d training steps, in each of which we generated a batch of 8,192 paths of (Xn)N n=0. To estimate the lower bound L we simulated KL = 4,096,000 trial paths. For our estimate of the upper bound U, we produced KU = 1,024 paths (zk n)N n=0, k = 1, . . . , KU, of (Xn)N n=0 and KU J realizations (vk,j n )N n=1, k = 1, . . . , KU, j = 1, . . . , J, of (Wtn Wtn 1)N n=1 with J = 16,384. Then for all n and k, we generated the i-th component of the j-th continuation path departing from zk n according to zi,k,j m = zi,k n exp [r δi σ2 i /2](m n) t + σi[vi,k,j n+1 + + vi,k,j m ] , m = n + 1, . . . , N. 16. That is, E[(W i t W i s)(W j t W i s)] = ρij(t s) for all i = j and s < t. 17. In fact, in none of the examples in this paper it is optimal to stop at time 0. So τ Θ = τ1 in all these cases. Deep Optimal Stopping Symmetric case We first considered the special case, where si 0 = s0, δi = δ, σi = σ for all i = 1, . . . , d, and ρij = ρ for all i = j. Our results are reported in Table 1. Asymmetric case As a second example, we studied model (20) with si 0 = s0, δi = δ for all i = 1, 2, . . . , d, and ρij = ρ for all i = j, but different volatilities σ1 < σ2 < < σd. For d 5, we chose the specification σi = 0.08 + 0.32 (i 1)/(d 1), i = 1, 2, . . . , d. For d > 5, we set σi = 0.1 + i/(2d), i = 1, 2, . . . , d. The results are given in Table 2. 4.2. Callable Multi Barrier Reverse Convertibles A MBRC is a coupon paying security that converts into shares of the worst-performing of d underlying assets if a prespecified trigger event occurs. Let us assume that the price of the i-th underlying asset in percent of its starting value follows the risk-neutral dynamics ( 100 exp [r σ2 i /2]t + σi W i t for t [0, Ti) 100(1 δi) exp [r σ2 i /2]t + σi W i t for t [Ti, T] (21) for a risk-free interest rate r R, volatility σi (0, ), maturity T (0, ), dividend payment time Ti (0, T), dividend rate δi [0, ) and a d-dimensional Brownian motion W with constant instantaneous correlations ρij R between different components W i and W j. Let us consider a MBRC that pays a coupon c at each of N time points tn = n T/N, n = 1, 2, . . . , N, and makes a time-T payment of ( F if min1 i d min1 m M Si um > B or min1 i d Si T > K min1 i d Si T if min1 i d min1 m M Si um B and min1 i d Si T K, where F [0, ) is the nominal amount, B [0, ) a barrier, K [0, ) a strike price and um the end of the m-th trading day. Its value is n=1 e rtnc + e r T E G (22) and can easily be estimated with a standard Monte Carlo approximation. A callable MBRC can be redeemed by the issuer at any of the times t1, t2, . . . , t N 1 by paying back the notional. To minimize costs, the issuer will try to find a {t1, t2, . . . , T}- valued stopping time such that n=1 e rtnc + 1{τ K min1 i d xi if min1 i d xi K. Since the issuer cannot redeem at time 0, we trained stopping times of the form n=1 nfθn(Yn) j=1 (1 fθj(Yk)) T1 for fθN 1 and fθ1, . . . , fθN 1 : Rd+1 {0, 1} given by (8) with I = 2 and q1 = q2 = d+40. Since (23) is a minimization problem, τ Θ yields an upper bound and the dual method a lower bound. We simulated the model (21) like (20) in Subsection 4.1 with the same number of trials except that here we used the lower number J = 1,024 to estimate the dual bound. Numerical results are reported in Table 3. 4.3. Optimally Stopping a Fractional Brownian Motion A fractional Brownian motion with Hurst parameter H (0, 1] is a continuous centered Gaussian process (W H t )t 0 with covariance structure E[W H t W H s ] = 1 2 t2H + s2H |t s|2H ; see, e.g., Mandelbrot and Van Ness (1968) or Samoradnitsky and Taqqu (1994). For H = 1/2, W H is a standard Brownian motion. So, by the optional stopping theorem, one has E W 1/2 τ = 0 for every W 1/2-stopping time τ bounded above by a constant; see, e.g., Grimmett and Stirzaker (2001). However, for H = 1/2, the increments of W H are correlated positively for H (1/2, 1] and negatively for H (0, 1/2). In both cases, W H is neither a martingale nor a Markov process, and there exist bounded W H-stopping times τ such that E W H τ > 0; see, e.g., Kulikov and Gusyatnikov (2016) for two classes of simple stopping rules 0 τ 1 and estimates of the corresponding expected values E W H τ . To approximate the supremum sup 0 τ 1 E W H τ (24) Deep Optimal Stopping d ρ ˆL t L ˆU t U Point est. 95% CI Non-callable 2 0.6 98.235 24.9 98.252 204.1 98.243 [98.213, 98.263] 106.285 2 0.1 97.634 24.9 97.634 198.8 97.634 [97.609, 97.646] 106.112 3 0.6 96.930 26.0 96.936 212.9 96.933 [96.906, 96.948] 105.994 3 0.1 95.244 26.2 95.244 211.4 95.244 [95.216, 95.258] 105.553 5 0.6 94.865 41.0 94.880 239.2 94.872 [94.837, 94.894] 105.530 5 0.1 90.807 41.1 90.812 238.4 90.810 [90.775, 90.828] 104.496 10 0.6 91.568 71.3 91.629 300.9 91.599 [91.536, 91.645] 104.772 10 0.1 83.110 71.7 83.137 301.8 83.123 [83.078, 83.153] 102.495 15 0.6 89.558 94.9 89.653 359.8 89.606 [89.521, 89.670] 104.279 15 0.1 78.495 94.7 78.557 360.5 78.526 [78.459, 78.571] 101.209 30 0.6 86.089 158.5 86.163 534.1 86.126 [86.041, 86.180] 103.385 30 0.1 72.037 159.3 72.749 535.6 72.393 [71.830, 72.760] 99.279 Table 3: Summary results for callable MBRCs with d underlying assets for F = K = 100, B = 70, T = 1 year (= 252 trading days), N = 12, c = 7/12, δi = 5%, Ti = 1/2, r = 0, σi = 0.2 and ρij = ρ for i = j. t U is the number of seconds it took to train τ Θ and compute ˆU. t L is the number of seconds it took to compute ˆL. The last column lists fair values of the same MBRCs without the callable feature. We estimated them by averaging 4,096,000 Monte Carlo samples of the payoff. This took between 5 (for d = 2) and 44 (for d = 30) seconds. over all W H-stopping times 0 τ 1, we denote tn = n/100, n = 0, 1, 2, . . . , 100, and introduce the 100-dimensional Markov process (Xn)100 n=0 given by X0 = (0, 0, . . . , 0) X1 = (W H t1 , 0, . . . , 0) X2 = (W H t2 , W H t1 , 0, . . . , 0) ... X100 = (W H t100, W H t99, . . . , W H t1 ). The discretized stopping problem sup τ T E g(Xτ), (25) where T is the set of all X-stopping times and g: R100 R the projection (x1, . . . , x100) 7 x1, approximates (24) from below. We computed estimates of (25) for H {0.01, 0.05, 0.1, 0.15, . . . , 1} by training networks of the form (8) with depth I = 2, d = 100 and q1 = q2 = 140. To simulate the vector Y = (W H tn )100 n=0, we used the representation Y = BZ, where BBT is the Cholesky decomposition of the covariance matrix of Y and Z a 100-dimensional random vector with independent standard normal components. We carried out 6,000 training steps with a batch size of 2,048. To estimate the lower bound L we generated KL = 4,096,000 simulations of Z. For our estimate of the upper bound U, we first simulated KU = 1,024 realizations vk, Becker, Cheridito and Jentzen k = 1, . . . , KU of Z and set wk = Bvk. Then we produced another KU J simulations vk,j, k = 1, . . . , KU, j = 1, . . . , J, of Z, and generated for all n and k, continuation paths starting from zk n = (wk n, . . . , wk 1, 0, . . . , 0) according to zk,j m = ( wk,j m , . . . , wk,j n+1, wk n, . . . , wk 1, 0 . . . , 0), m = n + 1, . . . , 100, i=1 Blivk i + i=n+1 Bli vk,j i , l = n + 1, . . . , m. For H {0.01, ..., 0.4} {0.6, ..., 1.0}, we chose J = 16,384, and for H {0.45, 0.5, 0.55}, J = 32,768. The results are listed in Table 4 and depicted in graphical form in Figure 1. Note that for H = 1/2 and H = 1, our 95% confidence intervals contain the true values, which in these two cases, can be calculated exactly. As mentioned above, W 1/2 is a Brownian motion, and therefore, E W 1/2 τ = 0 for every (W 1/2 tn )100 n=0-stopping time τ. On the other hand, one has18 W 1 t = t W 1 1 , t 0. So, in this case, the optimal stopping time is given18 by ( 1 if W 1 t1 > 0 t1 if W 1 t1 0, and the corresponding expectation by E W 1 τ = E W 1 1 1n W 1 t1>0 o W 1 t11n W 1 t1 0 o = 0.99 E h W 1 1 1{W 1 1 >0} 2π = 0.39495... Moreover, it can be seen that for H (1/2, 1), our estimates are up to three times higher than the expected payoffs generated by the heuristic stopping rules of Kulikov and Gusyatnikov (2016). For H (0, 1/2), they are up to five times higher. Acknowledgments We thank Philippe Ehlers, Ariel Neufeld, Martin Stefanik, the action editor and the referees for fruitful discussions and helpful comments. Charalambos D. Aliprantis and Kim C. Border. Infinite Dimensional Analysis. Springer, Berlin, 3rd Edition, 2006. Leif Andersen. A simple approach to the pricing of Bermudan swaptions in the multifactor LIBOR market model. The Journal of Computational Finance, 3(2):5 32, 2000. 18. up to P-almost sure equality Deep Optimal Stopping H ˆL ˆU Point est. 95% CI 0.01 1.518 1.519 1.519 [1.517, 1.520] 0.05 1.293 1.293 1.293 [1.292, 1.294] 0.10 1.048 1.049 1.049 [1.048, 1.050] 0.15 0.838 0.839 0.839 [0.838, 0.840] 0.20 0.658 0.659 0.658 [0.657, 0.659] 0.25 0.501 0.504 0.503 [0.501, 0.505] 0.30 0.369 0.370 0.370 [0.368, 0.371] 0.35 0.255 0.256 0.255 [0.254, 0.257] 0.40 0.155 0.158 0.156 [0.154, 0.158] 0.45 0.067 0.075 0.071 [0.066, 0.075] 0.50 0.000 0.005 0.002 [0.000, 0.005] 0.55 0.057 0.065 0.061 [0.057, 0.065] 0.60 0.115 0.118 0.117 [0.115, 0.119] 0.65 0.163 0.165 0.164 [0.163, 0.166] 0.70 0.206 0.207 0.207 [0.205, 0.208] 0.75 0.242 0.245 0.244 [0.242, 0.245] 0.80 0.276 0.278 0.277 [0.276, 0.279] 0.85 0.308 0.309 0.308 [0.307, 0.310] 0.90 0.336 0.339 0.337 [0.335, 0.339] 0.95 0.365 0.367 0.366 [0.365, 0.367] 1.00 0.395 0.395 0.395 [0.394, 0.395] Table 4: Estimates of supτ {0,t1,...,1} E W H τ . For all H {0.01, 0.05, . . . , 1}, it took about 430 seconds to train τ Θ and compute ˆL. The computation of ˆU took about 17,000 seconds for H {0.01, . . . , 0.4} {0.6, . . . , 1} and about 34,000 seconds for H {0.45, 0.5, 0.55}. Becker, Cheridito and Jentzen supτ E W H τ Figure 1: Estimates of supτ {0,t1,...,1} E W H τ for different values of H. Leif Andersen and Mark Broadie. Primal-dual simulation algorithm for pricing multidimensional American options. Management Science, 50(9):1222 1234, 2004. Vlad Bally, Gilles Pag es, and Jacques Printems. A quantization tree method for pricing and hedging multidimensional American options. Mathematical Finance, 15(1):119 168, 2005. J erˆome Barraquand and Didier Martineau. Numerical valuation of high dimensional multivariate American securities. The Journal of Financial and Quantitative Analysis, 30(3):383 405, 1995. Denis Belomestny. On the rates of convergence of simulation-based optimization algorithms for optimal stopping problems. The Annals of Applied Probability, 21(1):215 239, 2011. Denis Belomestny. Solving optimal stopping problems via empirical dual optimization. The Annals of Applied Probability, 23(5):1988 2019, 2013. Denis Belomestny, Christian Bender, and John Schoenmakers. True upper bounds for Bermudan products via non-nested Monte Carlo. Mathematical Finance, 19(1):53 71, 2009. Denis Belomestny, Mark Joshi, and John Schoenmakers. Multilevel dual approach for pricing American style derivatives. Finance and Stochastics, 17(4):717 742, 2013. Denis Belomestny, John Schoenmakers, Vladimir Spokoiny, and Yuri Tavyrikov. Optimal stopping via reinforced regression. ar Xiv:1808.02341, 2018. Deep Optimal Stopping Steffan J. Berridge and Johannes M. Schumacher. An irregular grid approach for pricing high-dimensional American options. Journal of Computational and Applied Mathematics, 222(1):94 111, 2008. Phelim P. Boyle, Adam W. Kolkiewicz, and Ken Seng Tan. An improved simulation method for pricing high-dimensional American derivatives. Mathematics and Computers in Simulation, 62:315 322, 2003. Mark Broadie and Menghui Cao. Improved lower and upper bound algorithms for pricing American options by simulation. Quantitative Finance, 8(8):845 861, 2008. Mark Broadie and Paul Glasserman. A stochastic mesh method for pricing high-dimensional American options. The Journal of Computational Finance, 7(4):35 72, 2004. Jacques F. Carriere. Valuation of the early-exercise price for options using simulations and nonparametric regression. Insurance: Mathematics and Economics, 19(1):19 30, 1996. Nan Chen and Paul Glasserman. Additive and multiplicative duals for American option pricing. Finance and Stochastics, 11(2):153 179, 2007. Mark H. A. Davis and Ioannis Karatzas. A deterministic approach to optimal stopping. In Probability, Statistics and Optimisation: A Tribute to Peter Whittle (ed. Frank P. Kelly), pages 455 466. John Wiley & Sons, New York, 1994. Vijay V. Desai, Vivek F. Farias, and Ciamac C. Moallemi. Pathwise optimization for optimal stopping problems. Management Science, 58(12):2292 2308, 2012. Daniel Egloff, Michael Kohler, and Nebojsa Todorovic. A dynamic look-ahead Monte Carlo algorithm for pricing Bermudan options. The Annals of Applied Probability, 17(4):1138 1171, 2007. Diego Garc ıa. Convergence and biases of Monte Carlo estimates of American option prices using a parametric exercise rule. Journal of Economic Dynamics and Control, 27(10):1855 1879, 2003. Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, PMLR, 9:249 256, 2010. Geoffrey Grimmett and David Stirzaker. Probability and Random Processes. Oxford University Press, 3rd Edition, 2001. Jiequn Han and Weinan E. Deep learning approximation for stochastic control problems. Deep Reinforcement Learning Workshop, NIPS, 2016. Martin B. Haugh and Leonid Kogan. Pricing American options: a duality approach. Operations Research, 52(2):258 270, 2004. Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, PMLR, 37:448 456, 2015. Becker, Cheridito and Jentzen Shashi Jain and Cornelis W. Oosterlee. The stochastic grid bundling method: efficient pricing of Bermudan options and their Greeks. Applied Mathematics and Computation, 269:412 431, 2015. Farshid Jamshidian. The duality of optimal exercise and domineering claims: a Doob Meyer decomposition approach to the Snell envelope. Stochastics, 79:27 60, 2007. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015. Michael Kohler, Adam Krzy zak, and Nebojsa Todorovic. Pricing of high-dimensional American options by neural networks. Mathematical Finance, 20(3):383 410, 2010. Anastasia Kolodko and John Schoenmakers. Iterative construction of the optimal Bermudan stopping time. Finance and Stochastics, 10(1):27 49, 2006. Alexander V. Kulikov and Pavel P. Gusyatnikov. Stopping times for fractional Brownian motion. In Computational Management Science, Volume 682 of Lecture Notes in Economics and Mathematical Systems, pages 195 200. Springer International Publishing, 2016. Damien Lamberton and Bernard Lapeyre. Introduction to Stochastic Calculus Applied to Finance. Chapman & Hall/CRC, 2nd Edition, 2008. J erˆome Lelong. Pricing American options using martingale bases. ar Xiv:1604.03317, 2016. Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6):861 867, 1993. Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. International Conference on Learning Representations, 2016. Francis A. Longstaffand Eduardo S. Schwartz. Valuing American options by simulation: a simple least-squares approach. The Review of Financial Studies, 14(1):113 147, 2001. Benoit B. Mandelbrot and John W. Van Ness. Fractional Brownian motions, fractional noises and applications. SIAM Review, 10(4):422 437, 1968. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu et al. Human-level control through deep reinforcement learning. Nature, 518:529 533, 2015. Goran Peskir and Albert N. Shiryaev. Optimal Stopping and Free-Boundary Problems. Lectures in Mathematics. Birkh auser Basel, 2006. Chris Rogers. Monte Carlo valuation of American options. Mathematical Finance, 12(3):271 286, 2002. Chris Rogers. Dual valuation and hedging of Bermudan options. SIAM Journal on Financial Mathematics, 1(1):604 608, 2010. Deep Optimal Stopping Gennady Samoradnitsky and Murad S. Taqqu. Stable Non-Gaussian Random Processes. Chapman and Hall/CRC, 1994. John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, PMLR, 37:1889 1897, 2015. Martin Schweizer. On Bermudan options. In Advances in Finance and Stochastics, pages 257 270. Springer Berlin Heidelberg, 2002. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484 489, 2016. Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375:1339 1364, 2018. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning. The MIT Press, 1998. James A. Tilley. Valuing American options in a path simulation model. Transactions of the Society of Actuaries, 45:83 104, 1993. John N. Tsitsiklis and Benjamin Van Roy. Regression methods for pricing complex American-style options. IEEE Transactions on Neural Networks, 12(4):694 703, 2001.