# stochastic_optimal_control_matching__a14bde30.pdf Stochastic Optimal Control Matching Carles Domingo-Enrich NYU & FAIR, Meta cd2754@nyu.edu Jiequn Han Flatiron Institute jhan@flatironinstitute.org Brandon Amos FAIR, Meta bda@meta.com Joan Bruna NYU & Flatiron Institute bruna@cims.nyu.edu Ricky T. Q. Chen FAIR, Meta rtqichen@meta.com Stochastic optimal control, which has the goal of driving the behavior of noisy systems, is broadly applicable in science, engineering and artificial intelligence. Our work introduces Stochastic Optimal Control Matching (SOCM), a novel Iterative Diffusion Optimization (IDO) technique for stochastic optimal control that stems from the same philosophy as the conditional score matching loss for diffusion models. That is, the control is learned via a least squares problem by trying to fit a matching vector field. The training loss, which is closely connected to the cross-entropy loss, is optimized with respect to both the control function and a family of reparameterization matrices which appear in the matching vector field. The optimization with respect to the reparameterization matrices aims at minimizing the variance of the matching vector field. Experimentally, our algorithm achieves lower error than all the existing IDO techniques for stochastic optimal control for three out of four control problems, in some cases by an order of magnitude. The key idea underlying SOCM is the path-wise reparameterization trick, a novel technique that may be of independent interest. 1 Introduction Stochastic optimal control aims to drive the behavior of a noisy system in order to minimize a given cost. It has myriad applications in science and engineering: examples include the simulation of rare events in molecular dynamics [37, 36, 85, 41], finance and economics [63, 25], stochastic filtering and data assimilation [58, 68], nonconvex optimization [19], sampling [9], power systems and energy markets [8, 66], and robotics [77, 32]. Stochastic optimal has also been impactful in fields such as mean-field games [17], optimal transport [80, 81], backward stochastic differential equations (BSDEs) [14] and large deviations [24]. Recently, it has been the basis of algorithms to sample from unnormalized densities [84, 79, 9, 71]. For continuous-time problems with low-dimensional state spaces, the standard approach to learn the optimal control is to solve the Hamilton-Jacobi-Bellman (HJB) partial differential equation (PDE) by gridding the space and using classical numerical methods. For high-dimensional problems, a large number of works parameterize the control using a neural network and train it applying a stochastic optimization algorithm on a loss function. These methods are known as Iterative Diffusion Optimization (IDO) techniques [59] (see Subsec. 2.2). It is convenient to draw an analogy between stochastic optimal control and continuous normalizing flows (CNFs), which are a generative modeling technique where samples are generated by solving an ordinary differential equation (ODE) for which the vector field has been learned, initialized at a Gaussian sample. CNFs were introduced by [20] (building on top of Rezende and Mohamed [70]), 38th Conference on Neural Information Processing Systems (Neur IPS 2024). and training them is similar to solving control problems because in both cases one needs to learn high-dimensional vector fields using neural networks, in continuous time. The first algorithm developed to train normalizing flows was based on maximizing the likelihood of the generated samples [20, Sec. 4]. Obtaining the gradient of the maximum likelihood loss with respect to the vector field parameters requires backpropagating through the computation of the ODE trajectory, or equivalently, solving the adjoint ODE in parallel to the original ODE. Maximum likelihood CNFs (ML-CNFs) were superseded by diffusion models [75, 40, 76] and flow-matching, a.k.a. stochastic interpolant, methods [55, 1, 65, 2], which are currently the preferred algorithms to train CNFs. Aside from architectural improvements such as the UNet [73], a potential reason for the success of diffusion and flow matching models is that their functional landscape is convex, unlike for ML-CNFs. Namely, vector fields are learned by solving least squares regression problems where the goal is to fit a random matching vector field. Convex functional landscapes in combination with overparameterized models and moderate gradient variance can yield very stable training dynamics and help achieve low error. Returning to stochastic optimal control, one of the best-performing IDO techniques amounts to choosing the control objective (equation 1) as the training loss (see (12)). As in ML-CNFs, computing the gradient of this loss requires backpropagating through the computation of the trajectories of the SDE (2), or equivalently, using an adjoint method. The functional landscape of the loss is highly non-convex, and the method is prone to unstable training (see green curve in the bottom right plot of Figure 3). In light of this, a natural idea is to develop the analog of diffusion model losses for the stochastic optimal control problem, to obtain more stable training and lower error, and this is what we set out to do in our work. Our contributions are as follows: We introduce Stochastic Optimal Control Matching (SOCM), a novel IDO algorithm in which the control is learned by solving a least-squares regression problem where the goal is to fit a random matching vector field which depends on a family of reparameterization matrices that are also optimized. We derive a bias-variance decomposition of the SOCM loss (Prop. 2). The bias term is equal to an existing IDO loss: the cross-entropy loss, which shows that both algorithms have the same landscape in expectation. However, SOCM has an extra flexibility in the choice of reparameterization matrices, which affect only the variance. Hence, we propose optimizing the reparameterization matrices to reduce the variance of the SOCM objective. The key idea that underlies the SOCM algorithm is the path-wise reparameterization trick (Prop. 1), which is a novel technique for estimating gradients of an expectation of a functional of a random process with respect to its initial value. It is of independent interest and may be more generally applicable outside of the settings considered in this paper. We perform experiments on four different settings where we have access to the ground-truth control. For three of these, SOCM obtains a lower L2 error with respect to the ground-truth control than all the existing IDO techniques, with around 10 lower error than competing methods in some instances. 2 Framework 2.1 Setup and Preliminaries Let (Ω, F, (Ft)t 0, P) be a fixed filtered probability space on which is defined a Brownian motion B = (Bt)t 0. We consider the control-affine problem min u U E R T 0 1 2 u(Xu t , t) 2+f(Xu t , t) dt+g(Xu T ) , (1) s.t. d Xu t =(b(Xu t , t)+σ(t)u(Xu t , t)) dt+ λσ(t)d Bt, Xu 0 p0 (2) and where Xu t Rd is the state, u : Rd [0, T] Rd is the feedback control and belongs to the set of admissible controls U, f : Rd [0, T] R is the state cost, g : Rd R is the terminal cost, b : Rd [0, T] Rd is the base drift, and σ : [0, T] Rd d is the invertible diffusion coefficient and λ (0, + ) is the noise level. In App. A we formally define the set U of admissible controls and describe the regularity assumptions needed on the control functions. In the remainder of the section we introduce relevant concepts in stochastic optimal control; we provide the most relevant proofs in App. B and refer the reader to Oksendal [60, Chap. 11] and Nüsken and Richter [59, Sec. 2] for a similar, more extensive treatment. Cost functional and value function The cost functional for the control u, point x and time t is defined as J(u; x, t) := E R T t 1 2 us(Xu s ) 2 + fs(Xu s ) dt + g(Xu T ) Xu t = x . That is, the cost functional is the expected value of the control objective restricted to the times [t, T] with the initial value x at time t. The value function or optimal cost-to-go at (x, t) is defined as the minimum value of the cost functional across all possible controls: V (x, t) := infu U J(u; x, t). (3) Hamilton-Jacobi-Bellman equation and optimal control If we define the infinitesimal generator L := λ 2 Pd i,j=1(σσ )ij(t) xi xj + Pd i=1 bi(x, t) xi, the value function solves the following Hamilton-Jacobi-Bellman (HJB) partial differential equation: ( t + L)V (x, t) 1 2 (σ V )(x, t) 2 + f(x, t) = 0, V (x, T) = g(x). (4) The verification theorem [62, Sec. 2.3] states that if a function V solves the HJB equation above and has certain regularity conditions, then V is the value function (3) of the problem (1)-(2). An implication of the verification theorem is that for every u U, V (x, t) + E 1 2 R T t σ V + u 2(Xu s , s) ds Xu t = x = J(u, x, t). (5) In particular, this implies that the unique optimal control is given in terms of the value function as u (x, t) = σ(t) V (x, t). Equation (5) can be deduced by integrating the HJB equation (4) over [t, T], and taking the conditional expectation with respect to Xu t = x. We include the proof of (5) in App. B for completeness. A pair of forward and backward SDEs (FBSDEs) Consider the pair of SDEs d Xt = b(Xt, t) dt + λσ(t)d Bt, X0 p0, (6) d Yt = ( f(Xt, t) + 1 2 Zt 2) dt + λ Zt, d Bt , YT = g(XT ). (7) where Y : Ω [0, T] R and Z : Ω [0, T] Rd are progressively measurable 1 random processes. It turns out that Yt and Zt defined as Yt := V (Xt, t) and Zt := σ(t) V (Xt, t) = u (Xt, t) satisfy (7). We include the proof in App. B for completeness. An analytic expression for the value function From the forward-backward equations (6)-(7), one can derive a closed-form expression for the value function V : V (x, t)= λ log E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x , (8) where Xt is the solution of the uncontrolled SDE (6). This is a classical result, but we still include its proof in App. B. Given that u (x, t) = σ(t) V (x, t), an immediate, yet important, consequence of (8) is the following path-integral representation of the optimal control: u (x, t)=λσ(t) x log E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x . (9) Remark this equation involves the gradient of logarithm of a conditional expectation, which is reminiscent of the vector fields that are learned when training diffusion models. For example, the target vector field for variance-exploding score-based diffusion loss [76] can be expressed as x log pt(x) = x log EY pdata[ exp( x Y 2/(2σ2 t )) (2πσ2 t )d/2 ]. Note, however, that in (9) the gradient is taken with respect to the initial condition of the process, which requires the development of novel techniques. Conditioned diffusions Let C = C([0, T]; Rd) be the Wiener space of continuous functions from [0, T] to Rd equipped with the supremum norm, and let P(C) be the space of Borel probability measures over C. For each control u U, the controlled process in equation (2) induces a probability measure in P(C), as the law of the paths Xu t , which we refer to as Pu. We let P be the probability measure induced by the uncontrolled process (6), and define the work functional W(X, t) := R T t f(Xs, s) ds + g(XT ). (10) 1Being progressively measurable is a strictly stronger property than the notion of being a process adapted to the filtration Ft of Bt (see [50]). It turns out (Lemma 2 in App. B) that the Radon-Nikodym derivative d Pu d P satisfies d Pu d P (X) = exp λ 1 V (X0, 0) W(X, 0) . Also, a straight-forward application of the Girsanov theorem for SDEs (Cor. 1) shows that d Pu (Xu )=exp λ 1/2 R T 0 u (Xu t , t) u(Xu t , t), d Bt λ 1 2 R T 0 u (Xu t , t) u(Xu t , t) 2 dt , (11) which means that the only control u U such that Pu = Pu is the optimal control itself. 2.2 Existing approaches and related work Low-dimensional case: solving the HJB equation For low-dimensional control problems (d 3), it is possible to grid the domain and use a numerical PDE solver to find a solution to the HJB equation (4). The main approaches include finite difference methods [11, 57, 4], which approximate the derivatives and gradients of the value function using finite differences, finite element methods [47], which involve restricting the solution to domain-dependent function spaces, and semi-Lagrangian schemes [21, 13, 12], which trace back characteristics and have better stability than finite difference methods. See Greif [33] for an overview on these techniques, and Baˇnas et al. [4] for a comparison between them. Hutzenthaler et al. [44] introduced the multilevel Picard method, which leverages the Feynman-Kac and the Bismut-Elworthy-Li formulas to beat the curse of dimensionality in some settings [6, 46, 45, 43]. High dimensional methods leveraging FBSDEs The FBSDE formulation in equations (6)-(7) has given rise to multiple methods to learn controls. One such approach is least-squares Monte Carlo (see Pham [63, Chapter 3] and Gobet [28] for an introduction, and Gobet et al. [30], Zhang et al. [83] for an extensive analysis), where trajectories from the forward process (6) are sampled, and then regression problems are solved backwards in time to estimate the expected future cost in the spirit of dynamic programming. A second method that exploits FBSDEs was proposed by E et al. [22], Han et al. [35]. They parameterize the control using a neural network uθ, and use stochastic gradient algorithms to minimize the loss L(uθ, y0) = E[(YT (y0, uθ) g(XT ))2], where YT (y0, uθ) is the process in (7) with initial condition y0 and control uθ. This algorithm can be seen as a shooting method, where the initial condition and the control are learned to match the terminal condition. Multiple recent works have combined neural networks with FBSDE Monte Carlo methods for parabolic and elliptic PDEs [5, 18, 86], control [7, 39], multi-agent games [34, 15, 16]; see [23] for a more comprehensive review. Many of the methods referenced above and some additional ones can be seen from a common perspective using controlled diffusions. As observed in equation (11), the key idea is that learning the optimal control is equivalent to finding a control u such that the induced probability measure Pu on paths is equal to the probability measure Pu for the optimal control. In the paragraphs below we cover several loss that fall into this framework. All the losses below can be optimized using a common algorithmic framework, which we describe in Algorithm 1. For more details, we refer the reader to Nüsken and Richter [59], which introduced this perspective and named such methods Iterative Diffusion Optimization (IDO) techniques. For simplicity, we introduce the losses for the setting in which the initial distribution p0 is concentrated at a single point xinit; we cover the general setting in App. B. The relative entropy loss and the adjoint method The relative entropy loss is defined as the Kullback-Leibler divergence between Pu and Pu : EPu[log d Pu d Pu ]. Upon removing constant terms and factors, this loss is equivalent to (see Lemma 3 in App. B): LAdj(u) := E R T 0 1 2 u(Xu t , t) 2+f(Xu t , t) dt+g(Xu T ) . (12) This is exactly the control objective in (1). This fact has been studied extensively [10, 31, 36, 48, 67]. Hence, the relative entropy loss is very natural and widely used; see Onken et al. [61], Zhang and Chen [84] for examples on multiagent systems and sampling. Solving optimization problems of the form (12) has a long history that dates back to Pontryagin [64]. Note that LAdj(u) depends on u both explicitly, and implicitly through the process Xu. To compute the gradient θ ˆLAdj(uθn) of a Monte Carlo approximation ˆLAdj(uθn) of LAdj(uθn) as required by Algorithm 1, we need to backpropagate through the simulation of the trajectories, which is why Algorithm 1 Iterative Diffusion Optimization (IDO) algorithms for stochastic optimal control Input: State cost f(x, t), terminal cost g(x), diffusion coeff. σ(t), base drift b(x, t), noise level λ, number of iterations N, batch size m, number of time steps K, initial control parameters θ0, loss L {LAdj(12), LCE(13), LVarv(16), Llog Varv(17), LMomv(18)} 1 for n {0, . . . , N 1} do 2 Simulate m trajectories of the process Xv controlled by v = uθn, e.g., using Euler-Maruyama updates 3 if L = LAdj then detach the m trajectories from the computational graph, so that gradients do not backpropagate; 4 Using the m trajectories, compute an m-sample Monte Carlo approximation ˆL(uθn) of the loss L(uθn) 5 Compute the gradients θ ˆL(uθn) of ˆL(uθn) w.r.t. θn 6 Obtain θn+1 with via an Adam update on θn (or another stochastic algorithm) Output: Learned control uθN we do not detach them from the computational graph. One can alternatively compute the gradient θ ˆLAdj(uθn) by explicitly solving an ODE, a technique known as the adjoint method. The adjoint method was introduced by Pontryagin [64], popularized in deep learning by Chen et al. [20], and further developed for SDEs in Li et al. [54]. The cross-entropy loss The cross-entropy loss is defined as the Kullback-Leibler divergence between Pu and Pu, i.e., flipping the order of the two measures: EPu [log d Pu d Pu ]. For an arbitrary v U, this loss is equivalent to the following one (see Prop. 3(i) in App. B): LCE(u) := E λ 1/2 R T 0 u(Xv t , t), d Bt λ 1 R T 0 u(Xv t , t), v(Xv t , t) dt+ λ 1 2 R T 0 u(Xv t , t) 2 dt exp λ 1W(Xv, 0) λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt . (13) The cross-entropy loss has a rich literature [38, 49, 74, 85] and has been recently used in applications such as molecular dynamics [41]. Furthermore, we note that the cross-entropy loss can be significantly simplified and written in terms of the unnormalized L2 error of the control u with respect to the optimal control u : LCE(u) = λ 1 2 E R T 0 u (Xu t , t) u(Xu t , t) 2 dt exp λ 1V (Xu 0 , 0) . (14) This characterization, which is proven in Prop. 3(ii) in App. B, is relevant for us because a similar one can be written for the loss that we propose (see Prop. 2). Variance and log-variance losses For an arbitrary v U, the variance and the log-variance losses are defined as LVarv(u) = Var Pv( d Pu d Pu ) and Llog Varv(u) = Var Pv(log d Pu d Pu ) whenever EPv| d Pu + and EPv| log d Pu d Pu | < + , respectively. Define Y u,v T = λ 1 R T 0 u(Xv t , t), v(Xv t , t) dt λ 1 R T 0 f(Xv t , t) dt λ 1/2 R T 0 u(Xv t , t), d Bt 2 R T 0 u(Xv t , t) 2 dt. Then, LVarv and Llog Varv are equivalent, respectively, to the following losses (see Lemma 4): LVarv(u) := Var exp Y u,v T λ 1g(Xv T ) , (16) Llog Varv(u) := Var Y u,v T λ 1g(Xv T ) , (17) The variance and log-variance losses were introduced by Nüsken and Richter [59]. Unlike for the cross-entropy loss, the choice of the control v does lead to different losses. When using LVarv or Llog Varv in Algorithm 1, the variance is computed across the m trajectories in each batch. Moment loss For an arbitrary v U, the moment loss is defined as LMomv(u, y0) = E[( Y u,v T + y0 λ 1g(Xv T ))2], (18) where Y u,v T is defined in (15). Note the similarity with the log-variance loss (17); the optimal value of y0 for a fixed u is y 0 = E[λ 1g(Xv T ) Y u,v T ], and plugging this into (18) yields exactly the log-variance loss. The moment loss was introduced by Hartmann et al. [39, Section III.B], and it is a generalization of the FBSDE method pioneered by E et al. [22], Han et al. [35] and referenced earlier in this subsection, which corresponds to setting v = 0. 3 Stochastic Optimal Control Matching In this section we present our loss, Stochastic Optimal Control Matching (SOCM). The corresponding method, which we describe in Algorithm 2, falls into the class of IDO techniques described in Subsec. 2.2. The general idea is to leverage the analytic expression of u in (9) to write a least squares loss for u, and the main challenge is to reexpress the gradient of a conditional expectation with respect to the initial condition of the process. We do that using a novel technique which introduces certain arbitrary matrix-valued functions Mt, that we also optimize. Theorem 1 (SOCM loss). For each t [0, T], let Mt : [t, T] Rd d be an arbitrary matrixvalued differentiable function such that Mt(t) = Id. Let v U be an arbitrary control. Let LSOCM : L2(Rd [0, T]; Rd) L2([0, T]2; Rd d) R be the loss function defined as LSOCM(u, M) := E 1 T R T 0 u(Xv t , t) w(t, v, Xv, B, Mt) 2 dt α(v, Xv, B) , (19) where Xv is the process controlled by v (i.e., d Xv t = (b(Xv t , t) + σ(t)v(Xv t , t)) dt + λσ(t) d Bt and Xv 0 p0), and w(t, v, Xv, B, Mt) = σ(t) R T t Mt(s) xf(Xv s , s) ds Mt(T) g(Xv T ) + R T t (Mt(s) xb(Xv s , s) s Mt(s))(σ 1) (s)v(Xv s , s) ds + λ1/2 R T t (Mt(s) xb(Xv s , s) s Mt(s))(σ 1) (s)d Bs , α(v, Xv, B) = exp λ 1 R T 0 f(Xv t , t) ds λ 1g(Xv T ) λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt . (20) LSOCM has a unique optimum (u , M ), where u is the optimal control. We refer to M = (Mt)t [0,T ] as the family of reparametrization matrices, to the random vector field w as the matching vector field, and to α as the importance weight. We present a proof sketch of Thm. 1; the full proofs for all the results in this section are in App. C. Proof sketch of Thm. 1 Let X be the uncontrolled process (6). Consider the loss T R T 0 u(Xt, t) u (Xt, t) 2 dt exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) T R T 0 u(Xt, t) 2 2 u(Xt, t), u (Xt, t) + u (Xt, t) 2 dt exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) . Clearly, the only optimum of this loss is the optimal control u . Using the analytic expression of u in (9), the cross-term can be rewritten as (see Lemma 5 in App. C): T R T 0 u(Xt, t), u (Xt, t) dt exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) T R T 0 u(Xt, t), σ(t) x E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x exp λ 1 R t 0 f(Xs, s) ds dt . It remains to evaluate the conditional expectation x E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x , which we do by a reparameterization trick that shifts the dependence on the initial value x into the stochastic processes here we introduce a free variable Mt and then applying Girsanov theorem. We coin this the path-wise reparameterization trick: Proposition 1 (Path-wise reparameterization trick for stochastic optimal control). For each t [0, T], let Mt : [t, T] Rd d be an arbitrary continuously differentiable function matrix-valued function such that Mt(t) = Id. We have that x E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x = E λ 1 R T t Mt(s) xf(Xs, s) ds λ 1Mt(T) g(XT ) + λ 1/2 R T t (Mt(s) xb(Xs, s) s Mt(s))(σ 1) (s)d Bs exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x . We prove a more general form of this result (Prop. 4) in Subsec. C.2 and also provide an intuitive derivation in Subsec. C.3. In the proof of Prop. 4, the reparameterization matrices Mt arise as the gradients of a perturbation to the process Xt. Similar ideas can potentially be applied to derive losses for generative modeling. If we plug (23) into the right-hand side of (22), and then this back into (21), and we complete the square, we obtain that for some constant K independent of u, L(u) = E 1 T R T 0 u(Xt, t) + σ(t) R T t Mt(s) xf(Xs, s) ds + Mt(T) g(XT ) λ1/2 R T t (Mt(s) xb(Xs, s) s Mt(s))(σ 1) (s)d Bs 2 dt exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) + K. If we perform a change of process from X to Xv applying the Girsanov theorem (Cor. 1 in App. C), we obtain the loss LSOCM(u, M). The following result clarifies the role of reparameterization matrices, connecting the SOCM and cross-entropy losses. Proposition 2 (Bias-variance decomposition of the SOCM loss). The SOCM loss decomposes into a bias term that only depends on u and a variance term that only depends on M: LSOCM(u, M) = Cond Var(w; M) | {z } Unnormalized expected conditional variance of w T R T 0 u(Xu t , t) u (Xu t , t) 2 dt e λ 1V (Xu 0 ,0) | {z } Unnormalized bias of u where Cond Var(w; M)=E 1 T R T 0 w(t, v, Xv, B, Mt) E[w(t,v,Xv,B,Mt)α(v,Xv,B)|Xv t ,t] E[α(v,Xv,B)|Xv t ,t] | {z } u (Xv t ,t) 2 dt α(v, Xv, B) . (25) Remark that the bias term in equation (24) is equal to the characterization of the cross-entropy loss in (14). In other words, the landscape of LSOCM(u, M) with respect to u is the landscape of the cross-entropy loss LCE(u). Thus, the SOCM loss can be seen as some form of variance reduction method for the cross-entropy loss, and performs substantially better experimentally (Sec. 4). Yet, the expressions of the SOCM loss and the cross-entropy loss are very different; the former is a least squares loss and is expressed in terms of the gradients of the costs. Algorithm 2 Stochastic Optimal Control Matching (SOCM) Input: State cost f(x, t), terminal cost g(x), diffusion coeff. σ(t), base drift b(x, t), noise level λ, number of iterations N, batch size m, number of time steps K, initial control parameters θ0, initial matrix parameters ω0, loss LSOCM in (19) 1 for n {0, . . . , N 1} do 2 Simulate m trajectories of the process Xv controlled by v = uθn, e.g., using Euler-Maruyama updates 3 Detach the m trajectories from the computational graph, so that gradients do not backpropagate 4 Using the m trajectories, compute an m-sample Monte-Carlo approximation ˆLSOCM(uθn, Mωn) of the loss LSOCM(uθn, Mωn) in (19) 5 Compute the gradients (θ,ω) ˆLSOCM(uθn, Mωn) of ˆLSOCM(uθn, Mωn) at (θn, ωn) 6 Obtain θn+1, ωn+1 with via an Adam update on θn, ωn, resp. Output: Learned control uθN For good training performance, it is critical that the gradients have high signal-to-noise ratio. Looking at the SOCM loss, a good proxy for low gradient variance is to have low variance for 1 T R T 0 u(Xv t , t) w(t, v, Xv, B, Mt) 2 dt α(v, Xv, B), and this holds when both α(v, Xv, B) and w(t, v, Xv, B, Mt) have low variance. Next, we present strategies to lower the variance of these two objects. Minimizing the variance of the importance weight α We want to use a vector field v such that Var[α(v, Xv, B)] is as low as possible. As shown by the following lemma, which is well-known in the literature, setting v to be the optimal control u actually achieves variance zero when we condition on the starting point of the controlled process Xv. The proof of this result can be found in Hartmann et al. [38], but we include it in Subsec. C.5 for completeness. Lemma 1. When we set v = u , the conditional variance Var[α(v, Xv, B)|Xv 0 = xinit] is zero for any xinit Rd. Of course, we do not have access to the optimal control u , but it is still a good idea to set v as the closest vector field to u that we have access to, which is typically the currently learned control. In some instances, one may benefit from using a warm-started control parameterized as u WS(x, t) + uθ(x, t), where the warm-start u WS is a reasonably good control obtained via a different strategy (see App. E). Minimizing the variance of the matching vector field w We are interested in finding the family M = (Mt)t [0,T ] that minimizes the variance of w(t, v, Xv, B, Mt) conditioned on t and Xt. Note that this is exactly the term Cond Var(w; M) in the right-hand side of equation (24). Since Cond Var(w; M) does not depend on the specific v, the optimal M does not depend on v either. And since the second term in the right-hand side of equation (24) does not depend on M = (Mt)t [0,T ], minimizing Cond Var(w; M) is equivalent to minimizing L(u) with respect to M. Parameterizing the matrices Mt vs solving for the optimal matrices In practice, we parameterize the matrices (Mt)t [0,T ] using a function Mω with two arguments (t, s). To enforce that Mω(t, t) = Id, we set Mω(t, s) = e γ(s t)Id+(1 e γ(s t)) M ω(t, s), where ω = (γ, ω), and M ω : R R Rd d is an unconstrained neural network. Alternatively, Thm. 4 in App. D shows that the optimal family M = (M t )t [0,T ] can be characterized as the solution of a linear equation in infinite dimensions (a Fredholm equation of the first kind). The discretized linear system has d2K equations and variables, K being the number of discretization time points. However, since the optimal M does not depend on v (see Remark 1), this is a computation that must be done only once and that may be affordable in some settings. We did not test this approach experimentally. 4 Experiments We consider four experimental settings that we adapt from Nüsken and Richter [59]: QUADRATIC ORNSTEIN UHLENBECK (EASY), QUADRATIC ORNSTEIN UHLENBECK (HARD), LINEAR ORNSTEIN UHLENBECK and DOUBLE WELL. We describe them in detail in App. F. For all of them, we have access to the ground-truth optimal control, which means that we are able to estimate the L2 error incurred by the learned control u. In Figure 2 we plot the control L2 error for each IDO algorithm described in Subsec. 2.2, and for the SOCM algorithm (Algorithm 2), for the QUADRATIC OU (EASY) and (HARD) settings. We also include two ablations of SOCM: (i) a version of SOCM where the reparameterization matrices Mt are set fixed to the identity I, (ii) SOCM-Adjoint, where we estimate the conditional expectation in equation (23) using the adjoint method for SDEs instead of the path-wise reparameterization trick (see Subsec. C.4). Code can be found at https://github.com/facebookresearch/SOC-matching. At the end of training, SOCM obtains the lowest L2 error, improving over all existing methods by a factor of around ten. The two SOCM ablations come in second and third by a substantial difference, which underlines the importance of the path-wise reparameterization trick. The best among existing methods is the adjoint method (the relative entropy loss). In Figure 2 (bottom) we show the squared norm of the gradient of each loss with respect to the parameters θ of the control: algorithms with small noise variance have low error values. In Figure 3, we plot the control L2 error for LINEAR ORNSTEIN UHLENBECK and DOUBLE WELL. For LINEAR OU, the error is around five times smaller for SOCM than for any existing method. For DOUBLE WELL, the SOCM algorithm achieves the third smallest error, slightly behind the variance loss and the adjoint method, but the latter shows instabilities. As we show in Figure 9 in App. F, these instabilities are inherent to the adjoint method and they do not disappear for small learning rates. Both in Figure 2 and Figure 3, we observe that learning the reparameterization matrices is critical to obtain gradient estimates with high signal-to-noise ratio. DOUBLE WELL is a particularly interesting and challenging setting because its solution is highly multimodal: g has 1024 modes. Multimodality is a feature observed in realistic settings, and is hard to handle because it involves learning the control correctly in each mode. The costs f and g and the base drift b for QUADRATIC OU (HARD) are five times those of QUADRATIC OU (EASY). Consequently, the factor α(v, Xv, B) initially has a much larger variance for the SOCM methods, and for cross-entropy. As training progresses, uθn gets closer to u , and consequently the variance of α(v, Xv, B) decreases, which in turn makes learning easier. This explains the initial slow decrease in the control error, followed by a fast drop that places SOCM well below existing algorithms. In App. E, we showcase a control warm-start strategy that can help and speed up convergence. We also present experimental results on two-mode Gaussian mixture sampling in increasing dimension, using the Path Integral Sampler [84]. We take Gaussians with means that are 2 units apart, and identity variance. Figure 1 shows control objective estimates obtained after running the Adjoint, SOCM, and Cross-entropy algorithms for 40000 iterations, at dimensions d = 2, 8, 16, 32, 64, and error bars show standard errors. By Theorem 4 of [84], we know that the optimal value of the control objective is zero; Figure 1 shows the suboptimality gaps incurred by each algorithm. Cross-entropy, Figure 1: This plot shows the control objective values for different algorithms (Adjoint, SOCM, and Cross-entropy) across multiple dimensions, with error bars indicating the standard deviations. The y-axis is restricted to [0, 0.1] for better visibility of the lower range values; cross-entropy takes value 2.915 0.008 at d = 64. which uses the same importance weight as SOCM, performs worse than the other two losses for all dimensions, and its results are particularly poor for dimension 64, because the variance of α is too large for learning to happen. In this case, we see that SOCM has better variance reduction than cross-entropy, despite both using importance weighted objectives for training. We observe that the values for SOCM are slightly below that of Adjoint for most dimensions, which confirms that our method is better for this range of dimensions. If we keep increasing the dimension, SOCM also fails due to higher variance of α: for n = 128, the control objective estimates for the Adjoint, SOCM, and Cross-Entropy losses are 0.146 0.001, 7.49 0.01, and 12.61 0.02, respectively. 5 Conclusion Our work introduces Stochastic Optimal Control Matching, a novel Iterative Diffusion Optimization technique for stochastic optimal control that stems from the same philosophy as the conditional score matching loss for diffusion models. That is, the control is learned via a least-squares problem by trying to fit a matching vector field. The training loss is optimized with respect to both the control function and a family of reparameterization matrices which appear in the matching vector field. Optimizing the reparameterization matrices reduces the variance of the matching vector field. Experimentally, our algorithm achieves lower error than all existing IDO techniques in four settings. One of the key ideas for deriving the SOCM algorithm is the path-wise reparameterization trick, a novel technique to obtain low-variance estimates of the gradient of the conditional expectation of a functional of a random process with respect to its initial value. An interesting future direction is to use the path-wise reparameterization trick to decrease the variance of the matching vector field for diffusion models. The main roadblock when we try to apply SOCM to more challenging problems is that the variance of the factor α(v, Xv, B) explodes when f and/or g are large, or when the dimension d is high. The large variance of α is due to the mismatch between the probability measures induced by the learned control and the optimal control, and it decreases as the learned control approaches the optimal control. The research presented is foundational, but it may serve as the basis of algorithms that improve the quality of generative models. Figure 2: Plots of the L2 error incurred by the learned control (left), and the norm squared of the gradient with respect to the parameters θ of the control (right), for the QUADRATIC ORNSTEIN UHLENBECK (EASY) (top) and (HARD) (bottom) settings and for each IDO loss. Both plots show exponential moving averages. Figure 3: Plots of the L2 error of the learned control for the LINEAR ORNSTEIN UHLENBECK and DOUBLE WELL settings. Funding disclosure Funded by respective author affiliations. [1] Michael S. Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants, 2022. [2] Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023. [3] Jonathan Baxter and Peter L Bartlett. Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319 350, 2001. [4] L ubomír Baˇnas, Herbert Dawid, Tsiry Avisoa Randrianasolo, Johannes Storn, and Xin Wen. Numerical approximation of a system of hamilton jacobi bellman equations arising in innovation dynamics. Journal of Scientific Computing, 92, 2022. [5] Christian Beck, Sebastian Becker, Philipp Grohs, Nor Jaafari, and Arnulf Jentzen. Solving stochastic differential equations and Kolmogorov equations by means of deep learning. ar Xiv:1806.00421, 2018. [6] Christian Beck, Fabian Hornung, Martin Hutzenthaler, Arnulf Jentzen, and Thomas Kruse. Overcoming the curse of dimensionality in the numerical approximation of Allen-Cahn partial differential equations via truncated full-history recursive multilevel Picard approximations. ar Xiv:1907.06729, 2019. [7] Sebastian Becker, Patrick Cheridito, and Arnulf Jentzen. Deep optimal stopping. Journal of Machine Learning Research, 20, 2019. [8] Andrea Belloni, Luigi Piroddi, and Maria Prandini. A stochastic optimal control solution to the energy management of a microgrid with storage and renewables. In 2016 American Control Conference (ACC), pages 2340 2345, 2016. [9] Julius Berner, Lorenz Richter, and Karen Ullrich. An optimal control perspective on diffusionbased generative modeling, 2023. [10] Joris Bierkens and Hilbert J Kappen. Explicit solution of relative entropy weighted control. Systems & Control Letters, 72:36 43, 2014. [11] J. Bonnans, Elisabeth Ottenwaelter, and Hasnaa Zidani. A fast algorithm for the two dimensional hjb equation of stochastic control. M2AN. Mathematical Modelling and Numerical Analysis. ESAIM, European Series in Applied and Industrial Mathematics, 38, 07 2004. [12] Elisa Calzola, Elisabetta Carlini, Xavier Dupuis, and Francisco Silva. A semi-Lagrangian scheme for Hamilton Jacobi Bellman equations with oblique derivatives boundary conditions. Numerische Mathematik, page 153, 2022. [13] E. Carlini, A. Festa, and N. Forcadel. A semi-Lagrangian scheme for Hamilton Jacobi Bellman equations on networks. SIAM J. Numer. Anal., 58(6):3165 3196, 2020. [14] René Carmona. Lectures on BSDEs, stochastic control, and stochastic differential games with financial applications, volume 1. SIAM, 2016. [15] René Carmona and Mathieu Laurière. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games i: The ergodic case. SIAM Journal on Numerical Analysis, 59(3):1455 1485, 2021. [16] René Carmona and Mathieu Laurière. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: Ii the finite horizon case. The Annals of Applied Probability, 32(6):4065 4105, 2022. [17] René Carmona, François Delarue, et al. Probabilistic Theory of Mean Field Games with Applications I-II. Springer, 2018. [18] Quentin Chan-Wai-Nam, Joseph Mikael, and Xavier Warin. Machine learning for semilinear PDEs. Journal of Scientific Computing, 79(3):1667 1712, 2019. [19] Pratik Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto, and Guillaume Carlier. Deep relaxation: partial differential equations for optimizing deep neural networks. Research in the Mathematical Sciences, 5(3):30, 2018. [20] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. [21] Kristian Debrabant and Espen R. Jakobsen. Semi-lagrangian schemes for linear and fully non-linear diffusion equations. Mathematics of Computation, 82(283):1433 1462, 2013. [22] W. E, Jiequn Han, and Arnulf Jentzen. Deep learning-based numerical methods for highdimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics, 5(4):349 380, 2017. [23] Weinan E, Jiequn Han, and Arnulf Jentzen. Algorithms for solving high dimensional pdes: from nonlinear monte carlo to machine learning. Nonlinearity, 35(1):278, 2021. [24] Jin Feng and Thomas G Kurtz. Large deviations for stochastic processes. Number 131. American Mathematical Soc., 2006. [25] Wendell H Fleming and Jerome L Stein. Stochastic optimal control, international finance and debt. Journal of Banking & Finance, 28(5):979 996, 2004. [26] Paul Glasserman and David D. Yao. Some guidelines and guarantees for common random numbers. Manage. Sci., 38(6):884 908, jun 1992. [27] Peter W. Glynn. Stochastic approximation for monte carlo optimization. In Proceedings of the 18th Conference on Winter Simulation, page 356 365. Association for Computing Machinery, 1986. [28] Emmanuel Gobet. Monte-Carlo methods and stochastic processes: from linear to non-linear. CRC Press, 2016. [29] Emmanuel Gobet and Rémi Munos. Sensitivity analysis using Itô Malliavin calculus and martingales, and application to stochastic optimal control. SIAM Journal on control and optimization, 43(5):1676 1713, 2005. [30] Emmanuel Gobet, Jean-Philippe Lemor, Xavier Warin, et al. A regression-based Monte Carlo method to solve backward stochastic differential equations. The Annals of Applied Probability, 15(3):2172 2202, 2005. [31] Vicenç Gómez, Hilbert J Kappen, Jan Peters, and Gerhard Neumann. Policy search for path integral control. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 482 497. Springer, 2014. [32] Alex Gorodetsky, Sertac Karaman, and Youssef Marzouk. High-dimensional stochastic optimal control using continuous tensor decompositions. International Journal of Robotics Research, 37(2-3), 3 2018. [33] Constantin Greif. Numerical methods for hamilton-jacobi-bellman equations. 2017. [34] Jiequn Han and Ruimeng Hu. Deep fictitious play for finding markovian nash equilibrium in multi-agent games. In Mathematical and scientific machine learning, pages 221 245. PMLR, 2020. [35] Jiequn Han, Arnulf Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505 8510, 2018. [36] Carsten Hartmann and Christof Schütte. Efficient rare event simulation by optimal nonequilibrium forcing. Journal of Statistical Mechanics: Theory and Experiment, 2012(11):P11004, 2012. [37] Carsten Hartmann, Ralf Banisch, Marco Sarich, Tomasz Badowski, and Christof Schütte. Characterization of rare events in molecular dynamics. Entropy, 16(1):350 376, 2014. [38] Carsten Hartmann, Lorenz Richter, Christof Schütte, and Wei Zhang. Variational characterization of free energy: Theory and algorithms. Entropy, 19(11), 2017. [39] Carsten Hartmann, Omar Kebiri, Lara Neureither, and Lorenz Richter. Variational approach to rare event simulation using least-squares regression. Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(6):063107, 2019. [40] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020. [41] Lars Holdijk, Yuanqi Du, Ferry Hooft, Priyank Jaini, Bernd Ensing, and Max Welling. Stochastic optimal control for collective variable free sampling of molecular transition paths, 2023. [42] James E. Hutton and Paul I. Nelson. Interchanging the order of differentiation and stochastic integration. Stochastic Processes and their Applications, 18(2):371 377, 1984. [43] Martin Hutzenthaler and Thomas Kruse. Multilevel picard approximations of high-dimensional semilinear parabolic differential equations with gradient-dependent nonlinearities. SIAM Journal on Numerical Analysis, 58(2):929 961, 2020. [44] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, et al. Multilevel picard iterations for solving smooth semilinear parabolic heat equations. ar Xiv preprint ar Xiv:1607.03295, 2016. [45] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen, and Philippe von Wurstemberger. Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations. ar Xiv:1807.01212, 2018. [46] Martin Hutzenthaler, Arnulf Jentzen, and Thomas Kruse. Overcoming the curse of dimensionality in the numerical approximation of parabolic partial differential equations with gradientdependent nonlinearities. ar Xiv:1912.02571, 2019. [47] Max Jensen and Iain Smears. On the convergence of finite element methods for hamilton jacobi bellman equations. SIAM Journal on Numerical Analysis, 51(1):137 162, 2013. [48] Hilbert J Kappen, Vicenç Gómez, and Manfred Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159 182, 2012. [49] Hilbert Johan Kappen and Hans Christian Ruiz. Adaptive importance sampling for control and inference. Journal of Statistical Physics, 162(5):1244 1266, 2016. [50] I. Karatzas and S. Shreve. Brownian Motion and Stochastic Calculus. Graduate Texts in Mathematics (113) (Book 113). Springer New York, 1991. [51] Patrick Kidger, James Foster, Xuechen Li, Harald Oberhauser, and Terry Lyons. Neural sdes as infinite-dimensional gans. In International Conference on Machine Learning, 2021. [52] Harold Kushner and Paul G Dupuis. Numerical methods for stochastic control problems in continuous time, volume 24. Springer Science & Business Media, 2013. [53] Pierre L Ecuyer and Gaétan Perron. On the convergence rates of ipa and fdc derivative estimators. Operations Research, 42(4):643 656, 1994. [54] Xuechen Li, Ting-Kam Leonard Wong, Ricky TQ Chen, and David Duvenaud. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, pages 3870 3882. PMLR, 2020. [55] Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2022. [56] Guan-Horng Liu, Yaron Lipman, Maximilian Nickel, Brian Karrer, Evangelos A. Theodorou, and Ricky T. Q. Chen. Generalized schrödinger bridge matching, 2023. [57] Jingtang Ma and Jianjun Ma. Finite difference methods for the hamilton-jacobi-bellman equations arising in regime switching utility maximization. J. Sci. Comput., 85(3):55, 2020. [58] Sanjoy K Mitter. Filtering and stochastic control: A historical perspective. IEEE Control Systems Magazine, 16(3):67 76, 1996. [59] Nikolas Nüsken and Lorenz Richter. Solving high-dimensional Hamilton Jacobi Bellman pdes using neural networks: perspectives from the theory of controlled diffusions and measures on path space. Partial differential equations and applications, 2:1 48, 2021. [60] Bernt Oksendal. Stochastic differential equations: an introduction with applications. Springer Science & Business Media, 2013. [61] Derek Onken, Levon Nurbekyan, Xingjian Li, Samy Wu Fung, Stanley Osher, and Lars Ruthotto. A neural network approach for high-dimensional optimal control applied to multiagent path finding. IEEE Transactions on Control Systems Technology, 31(1):235 251, jan 2023. [62] Grigorios A Pavliotis. Stochastic processes and applications: diffusion processes, the Fokker Planck and Langevin equations, volume 60. Springer, 2014. [63] Huyên Pham. Continuous-time stochastic control and optimization with financial applications, volume 61. Springer Science & Business Media, 2009. [64] L.S. Pontryagin. The Mathematical Theory of Optimal Processes. Interscience Publishers, 1962. [65] Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky T. Q. Chen. Multisample flow matching with optimal transport couplings. In International Conference on Machine Learning, 2023. [66] Warren B. Powell and Stephan Meisel. Tutorial on stochastic optimization in energy part i: Modeling and policies. IEEE Transactions on Power Systems, 31(2):1459 1467, 2016. [67] Konrad Rawlik, Marc Toussaint, and Sethu Vijayakumar. On stochastic optimal control and reinforcement learning by approximate inference. In Twenty-Third International Joint Conference on Artificial Intelligence, 2013. [68] Sebastian Reich. Data assimilation: The Schrödinger perspective. Acta Numerica, 28:635 711, 2019. [69] Martin I. Reiman and Alan Weiss. Sensitivity analysis for simulations via likelihood ratios. Oper. Res., 37:830 844, 1989. [70] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, 2015. [71] Lorenz Richter and Julius Berner. Improved sampling via learned diffusions. In The Twelfth International Conference on Learning Representations, 2024. [72] Geoffrey Roeder, Yuhuai Wu, and David K Duvenaud. Sticking the landing: Simple, lowervariance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [73] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234 241. Springer, 2015. [74] Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to combinatorial optimization, Monte-Carlo simulation and machine learning. Springer Science & Business Media, 2013. [75] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. ar Xiv preprint ar Xiv:1907.05600, 2019. [76] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR 2021), 2021. [77] Evangelos Theodorou, Freek Stulp, Jonas Buchli, and Stefan Schaal. An iterative path integral stochastic optimal control approach for learning robotic tasks. IFAC Proceedings Volumes, 44 (1):11594 11601, 2011. 18th IFAC World Congress. [78] Ramon Van Handel. Stochastic calculus, filtering, and stochastic control. Course notes, URL http://www. princeton. edu/rvan/acm217/ACM217, 2007. [79] Francisco Vargas, Will Sussman Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. In The Eleventh International Conference on Learning Representations, 2023. [80] C. Villani. Topics in Optimal Transportation. Graduate studies in mathematics. American Mathematical Society, 2003. [81] C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. [82] Jichuan Yang and Harold J. Kushner. A monte carlo method for sensitivity analysis and parametric optimization of nonlinear stochastic systems. SIAM Journal on Control and Optimization, 29(5):1216 1249, 1991. [83] Jianfeng Zhang et al. A numerical scheme for BSDEs. The annals of applied probability, 14(1): 459 488, 2004. [84] Qinsheng Zhang and Yongxin Chen. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2022. [85] Wei Zhang, Han Wang, Carsten Hartmann, Marcus Weber, and Christof Schütte. Applications of the cross-entropy method to importance sampling and optimal control of diffusions. SIAM Journal on Scientific Computing, 36(6):A2654 A2672, 2014. [86] Mo Zhou, Jiequn Han, and Jianfeng Lu. Actor-critic method for high dimensional static Hamilton Jacobi Bellman partial differential equations based on neural networks. SIAM Journal on Scientific Computing, 43(6):A4043 A4066, 2021. 1 Introduction 1 2 Framework 2 2.1 Setup and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Existing approaches and related work . . . . . . . . . . . . . . . . . . . . . . . . 4 3 Stochastic Optimal Control Matching 6 4 Experiments 8 5 Conclusion 9 A Technical assumptions 16 B Proofs of Sec. 2 16 C Proofs of Sec. 3 21 C.1 Proof of Thm. 1 and Prop. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Proof of the path-wise reparameterization trick (Prop. 1) . . . . . . . . . . . . . . 24 C.3 Informal derivation of the path-wise reparameterization trick . . . . . . . . . . . . 26 C.4 SOCM-Adjoint: replacing the path-wise reparameterization trick with the adjoint method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 C.5 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 D Optimal reparameterization matrices 30 E Control warm-starting 32 F Experimental details and additional plots 34 F.1 Experimental details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 F.2 Model architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 F.3 Additional tables and plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 A Technical assumptions Throughout our work, we make the same assumptions as [59], which are needed for all the objects considered to be well-defined. Namely, we assume that: (i) The set U of admissible controls is given by U = {u C1(Rd [0, T]; Rd) | C > 0, (x, s) Rd [0, T], b(x, s) C(1 + |x|)}. (ii) The coefficients b and σ are continuously differentiable, σ has bounded first-order spatial derivatives, and (σσ )(x, s) is positive definite for all (x, s) Rd [0, T]. Furthermore, there exist constants C, c1, c2 > 0 such that b(x, s) C(1 + x ), (linear growth) c1 ξ 2 ξ (σσ )(x, s)ξ c2 ξ 2, (ellipticity) for all (x, s) Rd [0, T] and ξ Rd. B Proofs of Sec. 2 Proof of (5) By Itô s lemma, we have that V (Xu T , T) V (Xu t , t) = R T t s V (Xu s , s) + b(Xu s , s) + σ(Xu s , s)u(Xu s , s), V (Xu s , s) 2 Pd i,j=1(σσ )ij(Xu s , s) xi xj V (Xu s , s) ds + Su t , where Su t = λ R T t V (Xu s , s) σ(Xu s , s) d Bs. Note that by (4), s V (Xu s , s) + b(Xu s , s) + σ(Xu s , s)u(Xu s , s), V (Xu s , s) 2 Pd i,j=1(σσ )ij(Xu s , s) xi xj V (Xu s , s) 2 (σ V )(Xu s , s) 2 f(Xu s , s) + σ(Xu s , s)u(Xu s , s), V (Xu s , s) 2 (σ V )(Xu s , s) + u(Xu s , s) 2 1 2 u(Xu s , s) 2 f(Xu s , s), and this implies that g(Xu T ) V (Xu t , t) = R T t 1 2 (σ V )(Xu s , s) + u(Xu s , s) 2 1 2 u(Xu s , s) 2 f(Xu s , s) ds + Su t (26) Since E[Su t | Xu t = x] = 0, rearranging (26) and taking the conditional expectation with respect to Xu t yields the final result. Proof of (6)-(7) By Itô s lemma, we have that d V (Xs, s) = s V (Xs, s) + b(Xs, s), V (Xs, s) 2 Pd i,j=1(σσ )ij(Xs, s) xi xj V (Xs, s) ds + λ V (Xu s , s) σ(Xu s , s) d Bs, (27) Note that by (4), s V (Xs, s) + b(Xs, s), V (Xs, s) + λ 2 Pd i,j=1(σσ )ij(Xs, s) xi xj V (Xs, s) 2 (σ V )(Xs, s) 2 f(Xs, s). Plugging this into (27) concludes the proof. Proof of (8) Since Ys = V (Xs, s) and Zs = σ (s) V (Xs, s) = u (Xs, s) satisfy (7), we have that g(XT ) = YT = Yt R T t (f(Xs, s) 1 2 u (Xs, s) 2) ds λ R T t u (Xs, s), d Bs . Hence, recalling the definition of the work functional in (10), we have that W(X, t) = Yt + 1 2 R T t u (Xs, s) 2 ds λ R T t u (Xs, s), d Bs . (28) By Novikov s theorem (Thm. 2), we have that E[exp( λ 1W(X, t))|Xt] = e λ 1Yt E exp λ 1/2 R T t u (Xs, s), d Bs λ 1 2 R T t u (Xs, s) 2 ds Xt = e λ 1Yt, which concludes the proof of (8). Theorem 2 (Novikov s theorem). Let θs be a locally-H2 process which is adapted to the natural filtration of the Brownian motion (Bt)t 0. Define Z(t) = exp R t 0 θs d Bs 1 2 R t 0 θs 2 ds . (29) If for each t 0, E exp R t 0 θs 2 ds < + , then for each t 0, E[Z(t)] = 1. (30) Moreover, the process Z(t) is a positive martingale, i.e. if (Ft)t 0 is the filtration associated to the Brownian motion (Bt)t 0, then for t s, E[Zt|Fs] = Zs. Theorem 3 (Girsanov theorem). Let W = (Wt)t [0,T ] be a standard Wiener process, and let P be its induced probability measure over C([0, T]; Rd), known as the Wiener measure. Let Z(t) be as defined in (29) and suppose that the assumptions of Theorem 2 hold. Let (Ω, F) be the σ-algebra associated to BT . For any F F, define the measure Q(F) = EP[Z(T)1F ] . Q is a probability measure because of (30). Under the probability measure Q, the stochastic process { W(t)}0 t T defined as W(t) = W(t) R t 0 θs ds is a standard Wiener process. That is, for any n 0 and any 0 = t0 < t1 < < tn, the increments { W(ti+1) W(ti)}n 1 i=0 are independent and Q-Gaussian distributed with mean zero and covariance (ti+1 ti)I, which means that for any α Rd, the moment generating function of W(ti+1) W(ti) with respect to Q is as follows: EQ[exp( α, W(ti+1) W(ti) )] := EP exp α, W(ti+1) R ti+1 0 θs ds W(ti) + R ti 0 θs ds Z(T) = exp (ti+1 ti) α 2 Corollary 1 (Girsanov theorem for SDEs). If the two SDEs d Xt = b1(Xt, t) dt + σ(Xt, t) d Bt, X0 = xinit d Yt = (b1(Yt, t) + b2(Yt, t)) dt + σ(Yt, t) d Bt, Y0 = xinit admit unique strong solutions on [0, T], then for any bounded continuous functional Φ on C([0, T]), we have that E[Φ(X)] = E Φ(Y ) exp R T 0 σ(Yt, t) 1b2(Yt, t) d Bt 1 2 R T 0 σ(Yt, t) 1b2(Yt, t) 2 dt = E Φ(Y ) exp R T 0 σ(Yt, t) 1b2(Yt, t) d Bt + 1 2 R T 0 σ(Yt, t) 1b2(Yt, t) 2 dt , where Bt = Bt + R t 0 σ(Ys, s) 1b2(Ys, s) ds. More generally, b1 and b2 can be random processes that are adapted to filtration of B. Lemma 2. For an arbitrary v U, let Pv and P be respectively the laws of the SDEs d Xv t = (b(Xv t , t) + σ(t)v(Xv t , t)) dt + λσ(t)d Bt, Xv 0 p0, d Xt = b(Xt, t) dt + λσ(t)d Bt, X0 p0. We have that d P d Pv (Xv) = exp λ 1/2 R T 0 v(Xv t , t), d Bv t + λ 1 2 R T 0 v(Xv t , t) 2 dt (31) = exp λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt , d P (X) = exp λ 1/2 R T 0 v(Xt, t), d Bt λ 1 2 R T 0 v(Xt, t) 2 dt . (32) where Bv t := Bt + λ 1/2 R t 0 v(Xv s , s) ds. For the optimal control u , we have that d P d Pu (Xu ) = exp λ 1 V (Xu 0 , 0) + W(Xu , 0) , (33) d P (X) = exp λ 1 V (X0, 0) W(X, 0) , (34) where the functional W is defined in (10). Proof. The proof of (31)-(32) follows directly from Cor. 1. To prove (34), we use that by (28), W(X, 0) = V (X0, 0) + 1 2 R T 0 u (Xs, s) 2 ds λ R T 0 u (Xs, s), d Bs , (35) which implies that d P (X) = exp λ 1/2 R T 0 u (Xt, t), d Bt λ 1 2 R T 0 u (Xt, t) 2 dt = exp λ 1 V (X0, 0) W(X, 0) . To prove (33), we use that since d Xu t = b(Xu t , t) dt + λσ(t)d Bu t , equation (35) holds if we replace X and B by Xu and Bu , which reads W(Xu , 0) = V (Xu 0 , 0) + 1 2 R T t u (Xu s , s) 2 ds λ R T t u (Xu s , s), d Bv s . d P d Pu (Xu ) = exp λ 1/2 R T 0 u (Xu t , t), d Bu t + λ 1 2 R T 0 u (Xu t , t) 2 dt = exp λ 1 V (Xu 0 , 0) + W(Xu , 0) . Lemma 3. The following expression holds: EPu log d Pu d Pu = λ 1E R T 0 1 2 u(Xu t , t) 2 + f(Xu t , t) dt + g(Xu T ) V (Xu 0 , 0) , (36) Proof. To prove (36), we write d Pu (Xu) = log d Pu d P (Xu) d P d Pu (Xu) = log d Pu d P (Xu) + log d P = λ 1 V (Xu 0 , 0) R T 0 f(Xu t , t) dt g(Xu T ) λ 1/2 R T 0 u(Xu t , t), d Bt λ 1 2 R T 0 u(Xu t , t) 2 dt . Since EPu log d Pu d Pu = EPu log d Pu d Pu , and EPu R T 0 u(Xu t , t), d Bt ] = 0, the result follows. Proposition 3. (i) The following two expressions hold for arbitrary controls u, v in the class U of admissible controls: LCE(u) = EPu log d Pu d Pu = E λ 1/2 R T 0 u(Xv t , t), d Bt λ 1 R T 0 u(Xv t , t), v(Xv t , t) dt (37) 2 R T 0 u(Xv t , t) 2 dt + λ 1 V (Xv 0 , 0) W(Xv, 0) exp λ 1 V (Xv 0 , 0) W(Xv, 0) λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt , LCE(u) = λ 1 2 E R T 0 u (Xu t , t) u(Xu t , t) 2 dt . (38) When p0 is concentrated at a single point xinit, the terms V (xinit, 0) are constant and can be removed without modifying the landscape. In other words, LCE and LCE are equal up to constant terms and constant factors. (ii) When p0 is a generic probability measure, LCE and LCE have different landscapes, and LCE(u) = EPu log d Pu d Pu exp λ 1V (Xu 0 , 0) . u is still the only minimizer of the loss LCE, and for some constant K, we have that LCE(u, 0) = λ 1 2 E R T 0 u (Xu t , t) u(Xu t , t) 2 dt exp λ 1V (Xu 0 , 0) + K. (39) Proof. We begin with the proof of (i), and prove (37) first. Note that by the Girsanov theorem (Thm. 3), EPu log d Pu d Pu (Xu ) = EPu log d Pu d Pu (Xu ) = EPu log d Pu d P (Xu ) + log d P d Pu (Xu ) = EPv log d Pu d P (Xv) + log d P d Pu (Xv) d Pu d P (Xv) d P d Pv (Xv) (40) Note that by equations (32) and (34), d P (Xv) = λ 1/2 R T 0 u(Xv t , t), d Bv t λ 1 2 R T 0 u(Xv t , t) 2 dt, = λ 1/2 R T 0 u(Xv t , t), d Bt + λ 1 R T 0 u(Xv t , t), v(Xv t , t) dt λ 1 2 R T 0 u(Xv t , t) 2 dt, log d P d Pu (Xv) = λ 1 V (Xv 0 , 0) + W(Xv, 0) . (41) where Bv t := Bt + λ 1/2 R t 0 v(Xv s , s) ds. Also, d P (Xv) = exp λ 1 V (Xv 0 , 0) W(Xv, 0) , d P d Pv (Xv) = exp λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt . (42) If we plug (41) and (42) into the right-hand side of (40), we obtain EPu log d Pu d Pu (Xu ) = EPu (log d Pu d P (Xv) + log d P d Pu (Xv)) d Pu d P (Xv) d P = E λ 1/2 R T 0 u(Xv t , t), d Bt + λ 1 R T 0 u(Xv t , t), v(Xv t , t) dt 2 R T 0 u(Xv t , t) 2 dt + λ 1 V (Xv 0 , 0) + W(Xv, 0) exp λ 1 V (Xv 0 , 0) W(Xv, 0) λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt , which concludes the proof. To show (38), we use that by Cor. 1, d Pu (Xu )=exp λ 1/2 R T 0 u (Xu t , t) u(Xu t , t), d Bt λ 1 2 R T 0 u (Xu t , t) u(Xu t , t) 2 dt . EPu log d Pu d Pu = EPu log d Pu 2 E R T 0 u (Xu t , t) u(Xu t , t) 2 dt . Next, we prove (ii). The first instance of V (Xv 0 , 0) in (37) can be removed without modifying the landscape of the loss. Hence, we are left with LCE(u) = E λ 1/2 R T 0 u(Xv t , t), d Bv t λ 1 R T 0 u(Xv t , t), v(Xv t , t) dt (43) 2 R T 0 u(Xv t , t) 2 dt λ 1W(Xv, 0) exp λ 1 V (Xv 0 , 0) W(Xv, 0) λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt And this can be expressed as LCE(u) = E g(u; Xv 0 ) exp λ 1V (Xv 0 , 0) , g(u; x) = E λ 1/2 R T 0 u(Xv t , t), d Bv t λ 1 R T 0 u(Xv t , t), v(Xv t , t) dt 2 R T 0 u(Xv t , t) 2 dt λ 1W(Xv, 0) exp λ 1W(Xv, 0) λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt |Xv 0 = x . If we consider g(u; x) as a loss function for u, note that it is equivalent to the loss LCE(u) equation in (43) for the choice p0 = δx, i.e., p0 concentrated at x. Since the optimal control u is independent of the starting distribution p0, we deduce that u is the unique minimizer of g(u; x), for all x Rd. In consequence, u is the unique minimizer of LCE(u) = E[g(u; Xv 0 )]. To prove (39), note that up to a constant term, the only difference between LCE(u) and LCE(u) is the expectation is reweighted importance weight exp λ 1V (Xv 0 , 0) . Lemma 4. (i) We can rewrite LVarv(u) = Var exp Y u,v T λ 1g(Xv T ) + λ 1V (Xv 0 , 0) , Llog Varv(u) = Var Y u,v T λ 1g(Xv T ) + λ 1V (Xv 0 , 0) . When p0 is concentrated at xinit, the terms V (xinit, 0) are constants and can be removed without modifying the landscape. In other words, LVarv and Llog Varv are equal to LVarv and Llog Varv up to a constant term and a constant factor, respectively. (ii) When p0 is general, LVarv and LVarv have a different landscape, and the optimum of LVarv may be different from u . A related loss that does preserve the optimum is: LVarv(u) = E[Var Pv( d Pu d Pu (Xv)|Xv 0 ) exp( λ 1V (Xv 0 , 0))] = E[Var exp( Y u,v T λ 1g(Xv T ))|Xv 0 ]. In practice, this is implemented by sampling the m trajectories in one batch starting at the same point Xv 0 . (iii) Also, Llog Varv and Llog Varv have a different landscape, and the optimum of Llog Varv may be different from u . In particular, Llog Varv(u) = Var Pv(log d Pu d Pu (Xv) exp( λ 1V (Xv 0 , 0))). A loss that does preserve the optimum u is Llog Varv(u) = E[Var Pv(log d Pu d Pu (Xv)|Xv 0 ) exp( λ 1V (Xv 0 , 0))] = E[Var Y u,v T λ 1g(Xv T )|Xv 0 ]. Proof. Using (34) and (31), we have that d P (Xv) = exp λ 1 V (Xv 0 , 0) W(Xv, 0) , d P d Pu (Xv) = exp λ 1/2 R T 0 u(Xv t , t), d Bv t + λ 1 2 R T 0 u(Xv t , t) 2 dt = exp λ 1/2 R T 0 u(Xv t , t), d Bt λ 1 R T 0 u(Xv t , t), v(Xv t , t) dt 2 R T 0 u(Xv t , t) 2 dt . d Pu (Xv) = log d Pu d P (Xv) + log d P d Pu (Xv) = Y u,v T λ 1g(Xv T ) + λ 1V (Xv 0 , 0). Since LVarv(u) = Var Pv( d Pu d Pu ) and Llog Varv(u) = Var Pv(log d Pu d Pu ), this concludes the proof of (i). To prove (ii), note that for general p0, V (Xv 0 , 0) is no longer a constant, but it is if we condition on Xv 0 . The proof of (iii) is analogous. C Proofs of Sec. 3 C.1 Proof of Thm. 1 and Prop. 2 We prove Thm. 1 and Prop. 2 at the same time. Recall that by (9), the optimal control is of the form u (x, t) = σ(t) V (x, t). Consider the loss T R T 0 u(Xt, t) + σ(t) V (Xt, t) 2 dt exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) . Clearly, the unique optimum of L is σ(t) V . We can rewrite L as T R T 0 u(Xt, t) 2 + 2 u(Xt, t), σ(t) V (Xt, t) + σ(t) V (Xt, t) 2 dt(44) exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) . Hence, we can express L as a sum of three terms: one involving u(Xt, t) 2, another involving u(Xt, t), σ(t) V (Xt, t) , and a third one, which is constant with respect to u, involving V (Xt, t) 2. The following lemma provides an alternative expression for the cross term: Lemma 5. The following equality holds: T R T 0 u(Xt, t), σ(t) V (Xt, t) dt exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) T R T 0 u(Xt, t), σ(t) x E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x exp λ 1 R t 0 f(Xs, s) ds dt . Proof. Recall the definition of W(X, t) in (35), which means that W(X, 0) = W(X, t) + R t 0 f(Xs, s) ds. (46) Let {Ft}t [0,T ] be the filtration generated by the Brownian motion B. Then, equation (9) implies that σ(t) V (Xt, t) = λσ(t) x E exp λ 1W(X,t) Ft E exp λ 1W(X,t) Ft (47) We proceed as follows: T R T 0 u(Xt, t), σ(t) V (Xt, t) dt exp λ 1W(X, 0) T R T 0 u(Xt, t), σ(t) x E exp λ 1W(X,t) Ft E exp λ 1W(X,t) Ft E exp λ 1W(X, t) Ft exp λ 1 R t 0 f(Xs, s) ds dt T R T 0 u(Xt, t), σ(t) x E exp λ 1W(X, t) Ft exp λ 1 R t 0 f(Xs, s) ds dt (ii) = λE 1 T R T 0 u(Xt, t), σ(t) x E exp λ 1W(X, t) Xt = x exp λ 1 R t 0 f(Xs, s) ds dt . Here, (i) holds by equation (47), the law of total expectation and equation (46), and (ii) holds by the Markov property of the solution of an SDE. The following proposition, which we prove in Subsec. C.2, provides an alternative expression for x E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x . The technique, which is novel and we denote by Girsanov reparamaterization trick, is of independent interest and may be applied in other settings, as we discuss in Sec. 5. Proposition 1 (Path-wise reparameterization trick for stochastic optimal control). For each t [0, T], let Mt : [t, T] Rd d be an arbitrary continuously differentiable function matrix-valued function such that Mt(t) = Id. We have that x E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x = E λ 1 R T t Mt(s) xf(Xs, s) ds λ 1Mt(T) g(XT ) + λ 1/2 R T t (Mt(s) xb(Xs, s) s Mt(s))(σ 1) (s)d Bs exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x . Plugging (23) into the right-hand side of (45), we obtain that T R T 0 u(Xt, t), σ(t) V (Xt, t) dt exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) T R T 0 u(Xt, t), σ(t) R T t Mt(s) xf(Xs, s) ds + Mt(T) g(XT ) λ1/2 R T t (Mt(s) xb(Xs, s) s Mt(s))(σ 1) (s)d Bs dt exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) . If we plug this into the right-hand side of (44) and complete the squared norm, we get that T R T 0 ( u(Xt, t) w(t, X, B, Mt) 2 w(t, X, B, Mt) 2 + u (Xt, t) 2) dt exp λ 1W(X, 0) where w is defined as: w(t, X, B, Mt) = σ(t) R T t Mt(s) xf(Xs, s) ds Mt(T) g(XT ) + λ1/2 R T t (Mt(s) xb(Xs, s) s Mt(s))(σ 1) (s)d Bs . We also define Φ(u; X, B) as Φ(u; X, B) = 1 T R T 0 ( u(Xt, t) w(t, X, B, Mt) 2) dt. Now, by the Girsanov theorem (Thm. 3), we have that for an arbitrary control v U, E[Φ(u; X, B) exp λ 1W(X, 0) ] =E Φ(u; Xv, Bv) exp λ 1W(Xv, 0) λ 1/2 R T 0 v(Xv t , t), d Bv t + λ 1 2 R T 0 v(Xv t , t) 2 dt =E Φ(u; Xv, Bv) exp λ 1W(Xv, 0) λ 1/2 R T 0 v(Xv t , t), d Bt λ 1 2 R T 0 v(Xv t , t) 2 dt , where Bv t := Bt + λ 1/2 R t 0 v(Xv s , s) ds. Reexpressing Bv in terms of B, we can rewrite Φ(u; Xv, Bv) and w(t, Xv, Bv, Mt) as follows: Φ(u; Xv, Bv) = 1 T R T 0 u(Xv t , t) w(t, Xv, Bv, Mt) 2 dt, w(t, Xv, Bv, Mt) = σ(t) R T t Mt(s) xf(Xv s , s) ds Mt(T) g(Xv T ) + λ1/2 R T t (Mt(s) xb(Xv s , s) s Mt(s))(σ 1) (Xv s , s)d Bs + R T t (Mt(s) xb(Xv s , s) s Mt(s))(σ 1) (Xv s , s)v(Xv s , s)ds . Putting everything together, we obtain that L(u) = LSOCM(u, M) K, where L(u, M) is the loss defined in (19) (note that w(t, v, Xv, B, Mt) := w(t, Xv, Bv, Mt)), and T R T 0 ( w(t, X, B, Mt) 2 u (Xt, t) 2) dt exp λ 1W(X, 0) To complete the proof of equation (24), remark that L(u) can be rewritten as T R T 0 u(Xt, t) u (Xt, t) 2 dt exp λ 1W(X, 0) T R T 0 u(Xt, t) u (Xt, t) 2 dt d Pu d P (X) exp( λ 1V (X0, 0)) T R T 0 u(Xu t , t) u (Xu t , t) 2 dt exp( λ 1V (Xu 0 , 0)) . It only remains to reexpress K. Note that by Prop. 1, we have that u (Xt, t) = E w(t,X,B,Mt) exp λ 1W(X,0) |Ft E exp λ 1W(X,0) |Ft = E w(t,X,B,Mt) d Pu d P (X)|Ft exp( λ 1V (X0,0)) d P (X)|Ft exp( λ 1V (X0,0)) = E w(t,X,B,Mt) d Pu = E w(t, Xu , Bu , Mt)|Xu t = Xt Hence, using the Girsanov theorem (Thm. 3) several times, we have that T R T 0 w(t, Xu , Bu , Mt) 2 E w(t, Xu , Bu , Mt)|Xu t 2 dt exp( λ 1V (Xu 0 , 0)) T R T 0 w(t, Xu , Bu , Mt) E w(t, Xu , Bu , Mt)|Xu t 2 dt exp( λ 1V (Xu 0 , 0)) T R T 0 w(t, X, B, Mt) E[ w(t,X,B,Mt) exp( λ 1W(X,0))|Xt] E[exp( λ 1W(X,0))|Xt] 2 dt exp( λ 1W(X, 0)) T R T 0 w(t, v, Xv, B, Mt) E[w(t,v,Xv,B,Mt)α(v,Xv,B)|Xv t ] E[α(v,Xv,B)|Xv t ] 2 dt α(v, Xv, B) , which concludes the proof, noticing that K = Var(w; M). Remark 1 (The optimal Mt is the same for all v). Looking at equation (48), we observe that Var(w; M) does not depend on the base control v. Since minimizing LSOCM(u, M) with respect to M is equivalent to minimizing Var(w; M), we deduce that the optimal M does not depend on the vector field v. C.2 Proof of the path-wise reparameterization trick (Prop. 1) We prove a more general statement (Prop. 4), and show that Prop. 1 is a particular case of it. Proposition 4 (Path-wise reparameterization trick). Let (Ω, F, P) be a probability space, and B : Ω [0, T] Rd be a Brownian motion. Let X : Ω [0, T] Rd be the uncontrolled process given by (6), and let ψ : Ω Rd [0, T] Rd be an arbitrary random process such that: For all z Rd, the process ψ( , z, ) : Ω [0, T] Rd is adapted to the filtration (Fs)s [0,T ] of the Brownian motion B. For all ω Ω, ψ(ω, , ) : Rd [0, T] Rd is a twice-continuously differentiable function such that ψ(ω, z, 0) = z for all z Rd, and ψ(ω, 0, s) = 0 for all s [0, T]. Let F : C([0, T]; Rd) R be a Fréchet-differentiable functional. We use the notation X +ψ(z, ) = (Xs(ω) + ψ(ω, z, s))s [0,T ] to denote the shifted process, and we will omit the dependency of ψ on ω in the proof. Then, x E exp F(X) X0 = x (49) =E z F(X+ψ(z, ))|z=0+λ 1/2 R T 0 ( zψ(0, s) xb(Xs, s) z sψ(0, s))(σ 1) (s)d Bs exp F(X) X0 = x Proof of Prop. 1. Given a family of functions (Mt)t [0,T ] satisfying the conditions in Prop. 1, we can define a family (ψt)t [0,T ] of functions ψt : Rd [t, T] Rd as ψt(z, s) = Mt(s) z. Note that ψt(z, t) = z for all z Rd and ψt(0, s) = 0 for all s [t, T], and that zψt(z, s) = Mt(s). Hence, ψt can be seen as a random process which is constant with respect to ω Ω, and which fulfills the conditions in Prop. 4 up to a trivial time change of variable from [t, T] to [0, T]. We also define the family (Ft)t [0,T ] of functionals Ft : C([t, T]; Rd) R as Ft(X) = λ 1 R T t f(Xs, s) ds + λ 1g(XT ). We have that z Ft(X+ψt(z, )) = z λ 1 R T t f(Xs + ψt(z, s), s) ds + λ 1g(XT + ψt(z, T)) (i)=λ 1 R T t zψt(z, s) f(Xs+ψt(z, s), s) ds+λ 1 zψt(z, T) g(XT +ψt(z, T)) =λ 1 R T t Mt(s) f(Xs+ψt(z, s), s) ds+λ 1Mt(T) g(XT +ψt(z, T)), where equality (i) holds by the Leibniz rule. Using that ψt(0, s) = 0, we obtain that: z Ft(X + ψt(z, )) z=0 = λ 1 R T t zψt(0, s) f(Xs, s) ds + λ 1 zψt(T, 0) g(XT ), Up to a trivial time change of variable from [t, T] to [0, T], Prop. 1 follows from plugging these choices into equation (49). Remark 2. We can use matrices Mt(s) that depend on the process X up to time s, since the resulting processes ψt( , z, ) are adapted to the filtration of the Brownian motion B. More specifically, if we let Mt : Rd [t, T] Rd d be an arbitrary continuously differentiable function matrix-valued function such that Mt(x, t) = Id for all x Rd, and we define the exponential moving average of X as the process X(υ) given by X(υ) t = υ R t 0 e υ(t s)Xs ds, we have that d ds Mt(X(υ) s , s)= Mt(X(υ) s , s), d X(υ) s ds + s Mt(X(υ) s , s)=λ x Mt(X(υ) s , s), Xs X(υ) s + s Mt(X(υ) s , s), and we can write x E exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x = E λ 1 R T t Mt(X(υ) s , s) xf(Xs, s) ds λ 1Mt(X(υ) T , T) g(XT ) + λ 1/2 R T t (Mt(X(υ) s , s) xb(Xs, s) d ds Mt(X(υ) s , s))(σ 1) (s)d Bs exp λ 1 R T t f(Xs, s) ds λ 1g(XT ) Xt = x . Plugging this into the proof of Thm. 1, we would obtain a variant of SOCM (Alg. 2) where the matrix-valued neural network Mω takes inputs (t, s, x) instead of (t, s). Since the optimization class is larger, from the bias-variance in Prop. 2 we deduce that this variant would yield a lower variance of the vector field w, and likely an algorithm with lower error. This is at the expense of an increased number of function evaluations (NFE) of Mω; one would need K(K+1)m 2 NFE per batch instead of only K(K+1) 2 , which may be too expensive if the architecture of Mω is large. A way to speed up the computation per batch is to parameterize Mt using cubic splines. Proof of Prop. 4. Recall that d Xs = b(Xs, s) ds + λσ(s) d Bs, X0 p0, is the SDE for the uncontrolled process. For arbitrary x, z Rd, we consider the following SDEs conditioned on the initial points: d X(x+z) s = b(X(x+z) s , s) ds + λσ(s) d Bs, X(x+z) 0 = x + z, (50) d X(x) s = b(X(x) s , s) ds + λσ(s) d Bs, X(x) 0 = x. (51) Suppose that ψ : Rd [0, T] Rd satisfies the properties in the statement of Prop. 4. If X(x) is a solution of d X(x) s = (b( X(x) s + ψ(z, s), s) sψ(z, s)) ds + λσ(s) d Bs, X(x) 0 = x, then X(x+z) = X(x) + ψ(z, ) is a solution of (50). This is because X(x+z) 0 = X(x) 0 + ψ(z, 0) = X(x) 0 + z = x + z, and d X(x+z) s = d X(x) s + sψ(z, s) ds = (b( X(x) s + ψ(z, s), s) sψ(z, s)) ds + λσ(s) d Bs + sψ(z, s) ds = b(X(x+z) s , s) ds + λσ(s) d Bs, Note that we may rewrite (51) as d X(x) s = (b(X(x) s + ψ(z, s), s) sψ(z, s)) ds + (b(X(x) s , s) b(X(x) s + ψ(z, s), s) + sψ(z, s)) ds + λσ(s) d Bs, X(x) t p0. Hence, since ψ(z, s) is a random process adapted to the filtration of B, we can apply the Girsanov theorem for SDEs (Corollary 1) on X(x) and X(x), and we have that for any bounded continuous functional Φ, E[Φ( X(x))] = E Φ(X(x)) exp R T 0 λ 1/2σ(s) 1(b(X(x) s + ψ(z, s), s) b(X(x) s , s) sψ(z, s)) d Bs 2 R T 0 λ 1/2σ(s) 1(b(X(x) s + ψ(z, s), s) b(X(x) s , s) sψ(z, s)) 2 ds . (52) We can write E exp F(X) X0 = x + z (i)= E exp F(X(x+z)) (ii) = E exp F( X(x) + ψ(z, )) (iii) = E exp F(X(x) + ψ(z, )) exp R T 0 λ 1/2σ(s) 1(b(X(x) s + ψ(z, s), s) b(X(x) s , s) sψ(z, s)) d Bs 2 R T 0 λ 1/2σ(s) 1(b(X(x) s + ψ(z, s), s) b(X(x) s , s) sψ(z, s)) 2 ds (iv) = E exp F(X+ψ(z, ))+ R T 0 λ 1/2σ(s) 1(b(Xs+ψ(z, s), s) b(Xs, s) sψ(z, s)) d Bs 2 R T 0 λ 1/2σ(s) 1(b(Xs + ψ(z, s), s) b(Xs, s) sψ(z, s)) 2 ds |X0 = x Equality (i) holds by the definition of X(x+z), equality (ii) holds by the fact X(x+z) s = X(x) s +ψ(z, s), equality (iii) holds by equation (52), and equality (iv) holds by the definition of X(x) s . We conclude the proof by differentiating the right-hand side of (53) with respect to z. Namely, x E exp F(X) X0 = x = z E exp F(X) X0 = x + z z=0 (i)= E z F(X + ψ(z, )) + λ 1/2 R T 0 ( zψ(0, s) xb(Xs, s) z sψ(0, s))(σ 1) (s)d Bs exp F(X) X0 = x In equality (i) we used (53), and that: by the Leibniz rule, z R T 0 σ(s) 1(b(Xs + ψ(z, s), s) b(Xs, s) sψ(z, s)) 2 ds z=0 = R T 0 z σ(s) 1(b(Xs + ψ(z, s), s) b(Xs, s) sψ(z, s)) 2 z=0 ds = 0. and by the Leibniz rule for stochastic integrals (see [42]), z R T 0 σ(s) 1(b(Xs + ψ(z, s), s) b(Xs, s) sψ(z, s)) d Bs z=0 = R T 0 ( zψ(0, s) xb(Xs, s) z sψ(0, s))(σ 1) (s) d Bs. C.3 Informal derivation of the path-wise reparameterization trick In this subsection, we provide an informal, intuitive derivation of the path-wise reparameterization trick as stated in Prop. 4. For simplicity, we particularize the functional F to F(X) = λ 1 R T 0 f(Xs, s) ds + λ 1g(XT ). Consider the Euler-Maruyama discretization of the uncontrolled process X defined in (6), with K + 1 time steps (let δ = T/K be the step size). This is a family of random variables ˆX = ( ˆXk)k=0:K defined as ˆX0 p0, ˆXk+1 = ˆXk + δb( ˆXk, kδ) + δλσ(kδ)εk, εk N(0, I). Note that we can approximate E exp λ 1 R T 0 f(Xs, s) ds λ 1g(XT ) X0 = x E exp λ 1δ PK 1 k=0 f( ˆXk, s) λ 1g( ˆXK) ˆX0 = x , and that this is an equality in the limit K , as the interpolation of the Euler-Maruyama discretization ˆX(x) converges to the process X(x). Now, remark that for k {0, . . . , K 1}, ˆXk+1| ˆXk N( ˆXk + δb( ˆXk, kδ), δλ(σσ )(kδ)). Hence, E exp λ 1δ PK 1 k=0 f( ˆXk, s) λ 1g( ˆXK) ˆX0 = x (Rd)K exp λ 1δ PK 1 k=0 f(ˆxk, s) λ 1g(ˆx K) 1 2δλ PK 1 k=1 σ 1(kδ)(ˆxk+1 ˆxk δb(ˆxk, kδ)) 2 1 2δλ σ 1(0)(ˆx1 x δb(x, 0)) 2 dˆx1 dˆx K, where C = q (2πδλ)K QK 1 k=0 det((σσ )(kδ)). Now, let ψ : Rd [0, T] Rd be an arbitrary twice differentiable function such that ψ(z, 0) = z for all z Rd, and ψ(0, s) = 0 for all s [0, T]. We can write x E exp λ 1δ PK 1 k=0 f( ˆXk, s) λ 1g( ˆXK) | ˆX0 = x = z E exp λ 1δ PK 1 k=0 f( ˆXk, s) λ 1g( ˆXK) | ˆX0 = x + z |z=0 (Rd)K exp λ 1δ PK 1 k=0 f(ˆxk, s) λ 1g(ˆx K) 1 2δλ PK 1 k=1 σ 1(kδ)(ˆxk+1 ˆxk δb(ˆxk, kδ)) 2 1 2δλ σ 1(0)(ˆx1 (x+z) δb(x+z, 0)) 2 dˆx1 dˆx K |z=0 (Rd)K exp λ 1δ PK 1 k=0 f(ˆxk + ψ(z, kδ), s) λ 1g(ˆx K + ψ(z, Kδ)) 1 2δλ PK 1 k=1 σ 1(kδ)(ˆxk+1 + ψ(z, (k + 1)δ) ˆxk ψ(z, kδ) δb(ˆxk+ψ(z, kδ), kδ)) 2 1 2δλ σ 1(0)(ˆx1+ψ(z, δ) (x+ψ(z, 0)) δb(x+ψ(z, 0), 0)) 2 dˆx1 dˆx K |z=0, (54) In the last equality, we used that for k {1, . . . , K}, the variables ˆxk are integrated over Rd, which means that adding an offset ψ(z, kδ) does not change the value of the integral. We also used that ψ(z, 0) = z. Now, for fixed values of ˆx = (ˆx1, . . . , ˆx K), and letting ˆx0 = x, we define Gˆx(z) = λ 1δ PK 1 k=0 f(ˆxk + ψ(z, kδ), s) + λ 1g(ˆx K + ψ(z, Kδ)) + 1 2δλ PK 1 k=0 σ 1(kδ)(ˆxk+1+ψ(z, (k + 1)δ) ˆxk ψ(z, kδ) δb(ˆxk+ψ(z, kδ), kδ)) 2. Using that ψ(0, s) = 0 for all s [0, T], we have that: Gˆx(0) = λ 1δ PK 1 k=0 f(ˆxk, s) + λ 1g(ˆx K) + 1 2δλ PK 1 k=0 σ 1(kδ)(ˆxk+1 ˆxk δb(ˆxk, kδ)) 2. Gˆx(z)|z=0 = λ 1δ PK 1 k=0 ψ(0, kδ) f(ˆxk, s) + λ 1 ψ(0, Kδ) g(ˆx K) δλ PK 1 k=0 ( zψ(0, (k + 1)δ) zψ(0, kδ) δ ψ(0, kδ) b(ˆxk, kδ)) ((σ 1) σ 1)(kδ)(ˆxk+1 ˆxk δb(ˆxk, kδ)) And we can express the right-hand side of (54) in terms of Gˆx(0) and Gˆx(z)|z=0: (Rd)K exp Gˆx(z) dy1 dy K (Rd)K Gˆx(z)|z=0 exp Gˆx(0) dy1 dy K We define ϵk = 1 δλσ 1(kδ)(ˆxk+1 ˆxk δb(ˆxk, kδ)), and then, we are able to write ˆxk+1 = ˆxk + δb(ˆxk, kδ) + δλσ(kδ)ϵk, ˆx0 = x (55) Gˆx(0) = λ 1δ PK 1 k=0 f(ˆxk, s) + λ 1g(ˆx K) + 1 2 PK 1 k=0 ϵk 2, Gˆx(z)|z=0 = λ 1δ PK 1 k=0 ψ(0, kδ) f(ˆxk, s) + λ 1 ψ(0, Kδ) g(ˆx K) δλ 1 PK 1 k=0 ( s zψ(0, kδ) + O(δ) ψ(0, kδ) b(ˆxk, kδ))(σ 1) (kδ)ϵk. (56) Then, taking the limit K (i.e. δ 0), we recognize (55) as Euler-Maruyama discretization of the uncontrolled process X in equation (6) conditioned on X0 = x, and the last term in (56) as the Euler-Maruyama discretization of the stochastic integral λ 1/2 R T 0 ( s zψ(0, s) ψ(0, s) b(X(x) s , s))(σ 1) (s) d Bs. Thus, lim K x E exp λ 1δ PK 1 k=0 f( ˆXk, s) λ 1g( ˆXK) = E λ 1 R T 0 ψ(0, s) xf(Xs, s) ds λ 1 ψ(0, T) g(XT ) + λ 1/2 R T 0 ( ψ(0, s) xb(Xs, s) s ψ(0, s))(σ 1) (s) d Bs exp λ 1 R T 0 f(Xs, s) ds λ 1g(XT ) X0 = x , which concludes the derivation. C.4 SOCM-Adjoint: replacing the path-wise reparameterization trick with the adjoint method Proposition 5. Let LSOCM Adj : L2(Rd [0, T]; Rd) R be the loss function defined as LSOCM Adj(u) := E 1 T R T 0 u(Xv t , t) + σ(t) a(t, Xv) 2 dt α(v, Xv, B) , where Xv is the process controlled by v (i.e., d Xt = (b(Xt, t) + σ(t)v(Xt, t)) dt + λσ(Xt, t) d Bt and X0 p0), α(v, Xv, B) is the importance weight defined in (20), and a(t, Xv) is the solution of the ODE da(t) dt = xb(Xv t , t)a(t) xf(Xv t , t), a(T) = g(Xv T ), LSOCM Adj has a unique optimum, which is the optimal control u . Proof. The proof follows the same structure as that of Thm. 1. Instead of plugging the path-wise reparameterization trick (Prop. 1) in the right-hand side of (22), we make use of Lemma 6 to evaluate x E exp λ 1 R T 0 f(Xt, t) dt λ 1g(XT ) |X0 = x . Particular cases of the result in Lemma 6 have been used in previous works such as [54, 51]. We present a more general form that covers state costs f, as well as stochastic integrals. We also present a simpler proof of the result based on Lagrange multipliers. Lemma 6 (Adjoint method for SDEs). [54, 51] Let X : Ω [0, T] Rd be the uncontrolled process defined in (6), with initial condition X0 = x. We define the random process a : Ω [0, T] Rd such that for all ω Ω, using the short-hand a(t) := a(ω, t), dat(ω) = xb(Xt(ω), t)at(ω) xf(Xt(ω), t) dt xh(Xt(ω), t) d Bt, a T (ω) = xg(XT (ω)), we have that x E R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω))|X0(ω) = x = E a0(ω) , x E exp R T 0 f(Xt(ω), t) dt R T 0 h(Xt(ω), t), d Bt g(XT (ω)) |X0(ω) = x = E a0(ω) exp R T 0 f(Xt(ω), t) dt R T 0 h(Xt(ω), t), d Bt g(XT (ω)) |X0(ω) = x . Proof. We will use an approach based on Lagrange multipliers. Define a process a : Ω [0, T] Rd such that for any ω Ω, a(ω, ) is differentiable. For a given ω Ω, we can write R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω)) = R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω)) R T 0 at(ω), (d Xt(ω) b(Xt(ω), t) dt σ(t) d Bt) . By Lemma 7, we have that R T 0 at(ω), d Xt(ω) = a T (ω), XT (ω) a0(ω), X0(ω) R T 0 Xt(ω), dat dt (ω) dt. Hence, x R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω)) = x R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω)) a T (ω), XT (ω) + a0(ω), X0(ω) + R T 0 at(ω), b(Xt(ω), t) + dat dt (ω), Xt(ω) dt + R T 0 at(ω), σ(t) d Bt = R T 0 x Xt(ω) xf(Xt(ω), t) dt + R T 0 x Xt(ω) xh(Xt(ω), t) d Bt + x XT (ω) xg(XT (ω)) x XT (ω)a T (ω) + x X0(ω)a0(ω) + R T 0 x Xt(ω) xb(Xt(ω), t)at(ω) + x Xt(ω) dat = R T 0 x Xt(ω) xf(Xt(ω), t) + xb(Xt(ω), t)at(ω) + dat + x XT (ω) xg(XT (ω)) a T (ω) + a0(ω) + R T 0 x Xt(ω) xh(Xt(ω), t) d Bt. In the last line we used that x X0(ω) = xx = I. If choose a such that dat(ω) = xb(Xt(ω), t)at(ω) xf(Xt(ω), t) dt xh(Xt(ω), t) d Bt, a T (ω) = xg(XT (ω)), then we obtain that x R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω)) = a0(ω), and by the Leibniz rule, x E R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω)) = E x R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω)) = E a0(ω) , x E exp R T 0 f(Xt(ω), t) dt R T 0 h(Xt(ω), t), d Bt g(XT (ω)) = E x R T 0 f(Xt(ω), t) dt + R T 0 h(Xt(ω), t), d Bt + g(XT (ω)) exp R T 0 f(Xt(ω), t) dt R T 0 h(Xt(ω), t), d Bt g(XT (ω)) = E a0(ω) exp R T 0 f(Xt(ω), t) dt R T 0 h(Xt(ω), t), d Bt g(XT (ω)) . Lemma 7 (Stochastic integration by parts, [60]). Let d Xt = at dt + bt d W 1 t , d Yt = ft dt + gt d W 2 t . where at, bt, ft, gt are continuous square integrable processes adapted to a filtration (Ft)t [0,T ], and W 1, W 2 are Brownian motions adapted to the same filtration. Then, Xt Yt X0Y0 = R t 0 Xs d Ys + R t 0 Ys d Xs + R t 0 d Xs, d Ys = R t 0 Xs d Ys + R t 0 Ys d Xs + R t 0 bs d W 1 s , gs d W 2 s . Remark 3 (Related work to the path-wise reparameterization trick: sensitivity analysis). As shown above, the adjoint method for SDEs is an alternative to the path-wise reparameterization trick. Prior to [54], an array of works developed methods to compute derivatives of functionals of stochastic processes with respect to generic parameters α that appear either in the drift or diffusion coefficients [52]. This area is known as sensitivity analysis, and has been developed largely with financial applications in mind (more specifically, to compute the "Greeks"). In low dimensions, dynamic programming [3] or finite differences [26, 53] work well, but they scale poorly to high dimensions. In high dimensions, several approaches have been proposed (see the section 1 of [29] for a comprehensive although dated overview): The path-wise method (which we refer to as adjoint method) involves taking the gradient αE[f(Xt)] inside of the expectation as E[ αf(Xt)] and was first described by [82]. The likelihood method or score method [27, 69] consists in rewriting αE[f(XT )] as E[f(XT )H], where H is a random variable which is equal to α log p(α, XT ), p(α, ) being the density of the law of XT with respect to the Lebesgue measure. [82] provide explicit weights H, under the restrictions that α appears only in the drift of the SDE (and not in the diffusion coefficient) and that the diffusion coefficient is elliptic, using the Girsanov theorem. [29] provide an expression for H in the case where H also appears in the diffusion coefficient, using Malliavin calculus. The estimator of the path-wise reparameterization trick is formally similar to the likelihood method estimator, but it is different in that α is the initial condition of the process, and does not appear either in the drift nor the diffusion coefficient. C.5 Proof of Lemma 1 Proof. Since the equality (28) holds almost surely for the pair (X, B), it must also hold almost surely for (Xv, Bv), which satisfy the same SDE. That is W(Xv, 0) = V (Xv 0 , 0) + 1 2 R T 0 u (Xv s , s) 2 ds λ R T 0 u (Xv s , s), d Bv s , Thus, we obtain that α(v, Xv, B) = exp λ 1W(Xv, 0) λ 1/2 R T 0 v(Xv t , t), d Bv t + λ 1 2 R T 0 v(Xv t , t) 2 dt = exp λ 1V (Xv 0 , 0) λ 1 2 R T 0 u (Xv s , s) 2 ds + λ 1/2 R T 0 u (Xv s , s), d Bv s λ 1/2 R T 0 v(Xv t , t), d Bv t + λ 1 2 R T 0 v(Xv t , t) 2 dt , and this is equal to exp V (Xv 0 , 0) when v = u . Since we condition on Xv 0 = xinit, we have obtained that the random variable takes constant value exp V (xinit, 0) almost surely, which means that its variance is zero. D Optimal reparameterization matrices Theorem 4 (Optimal reparameterization matrices). Let v be an arbitrary control in U. Define the integral operator Tt : L2([t, T]; Rd d) L2([t, T]; Rd d) as [Tt( Mt)](s) = R T t Mt(s )E χ(s , Xv, B)χ(s, Xv, B) α(v, Xv, B) ds , where χ(t, Xv, B) := R T t xf(Xv s , s) ds + g(Xv T ) + (σ 1 t ) (t)v(Xv t , t) R T t xb(Xv s , s)(σ 1 s ) (s)v(Xv t , t) ds R T t xb(Xv s , s)(σ 1 s ) (s) d Bs. If we define Nt(s) = E g(Xv T ) + R T t xf(Xv s , s ) ds χ(t, Xv, B) α(v, Xv, B) , the optimal M = (M t )t [0,T ] is of the form M t (s) = I + R s t M t (s ) ds , where M t is the unique solution of the following Fredholm equation of the first kind: Tt( Mt) = Nt. The proof of (25) shows that minimizing Var(w; M) is equivalent to minimizing T R T 0 w(t, v, Xv, B, Mt) 2 dt α(v, Xv, B) . (57) To optimize with respect to M, it is convenient to reexpress it in terms of M = ( Mt)t [0,T ] as Mt(s) = I + R s t Mt(s ) ds . By Fubini s theorem, we have that R T t Mt(s) xf(Xv s , s) ds = R T t I + R s t Mt(s ) ds xf(Xv s , s) ds = R T t xf(Xv s , s) ds + R T t Mt(s) R T s xf(Xv s , s ) ds ds, R T t (Mt(s) xb(Xv s , s) Mt(s))(σ 1) (s)v(Xv s , s) ds = R T t Mt(s)(σ 1) (s)v(Xv s , s) ds R T t Mt(s) R T s xb(Xv s , s )(σ 1 s ) (s )v(Xv s , s) ds ds, λ1/2 R T t (Mt(s) xb(Xv s , s) Mt(s))(σ 1) (s) d Bs = λ1/2 R T t Mt(s)(σ 1) (s)v(Xv s , s) ds R T t Mt(s) R T s xb(Xv s , s )(σ 1 s ) (s ) d Bs ds . Hence, we can rewrite (57) as T R T 0 σ(t) R T t xf(Xv s , s) ds + g(Xv T ) + R T t Mt(s) R T s xf(Xv s , s ) ds + g(Xv T ) + (σ 1) (s)v(Xv s , s) R T s xb(Xv s , s )(σ 1 s ) (s )v(Xv s , s) ds R T s xb(Xv s , s )(σ 1 s ) (s ) d Bs ds 2 dt α(v, Xv, B) The first variation δG δ M ( M) of G at M is defined as the family Q = (Qt)t [0,T ] of matrix-valued functions such that for any collection of matrix-valued functions P = (Pt)t [0,T ], ϵV( M + ϵP)|ϵ=0 = limϵ 0 V( M+ϵP ) V(M) ϵ = P, Q := R T 0 R T t Pt(s), Qt(s) F ds dt, where M + ϵP := ( Mt + ϵPt)t [0,T ]. Now, note that ϵV( M + ϵP)|ϵ=0 = ϵE 1 T R T 0 σ(t) R T t xf(Xv s , s) ds + g(Xv T ) + R T t ( Mt(s) + ϵPt(s)) R T s xf(Xv s , s ) ds + g(Xv T ) + (σ 1) (s)v(Xv s , s) R T s xb(Xv s , s )(σ 1 s ) (s )v(Xv s , s) ds R T s xb(Xv s , s )(σ 1 s ) (s ) d Bs ds 2 dt α(v, Xv, B) ϵ=0 = E 2 T R T 0 σ(t)σ(t) R T t xf(Xv s , s) ds + g(Xv T ) + R T t Mt(s) R T s xf(Xv s , s ) ds + g(Xv T ) + (σ 1) (s)v(Xv s , s) R T s xb(Xv s , s )(σ 1 s ) (s )v(Xv s , s) ds R T s xb(Xv s , s )(σ 1 s ) (s ) d Bs ds , R T t Pt(s) R T s xf(Xv s , s ) ds + g(Xv T ) + (σ 1) (s)v(Xv s , s) R T s xb(Xv s , s )(σ 1 s ) (s )v(Xv s , s) ds R T s xb(Xv s , s )(σ 1 s ) (s ) d Bs ds dt α(v, Xv, B) . (58) If we define χ(s, Xv, B) := R T s xf(Xv s , s ) ds + g(Xv T ) + (σ 1) (s)v(Xv s , s) R T s xb(Xv s , s )(σ 1 s ) (s )v(Xv s , s) ds R T s xb(Xv s , s )(σ 1 s ) (s ) d Bs , we can rewrite (58) as ϵV( M +ϵP)|ϵ=0 =E 1 T R T 0 σ(t)σ(t) R T t xf(Xv s , s)ds+ g(Xv T )+ R T t Mt(s)χ(s, Xv, B) ds , (59) R T t Pt(s)χ(s, Xv, B)ds ds α(v, Xv, B) Now let us reexpress equation (59) as: T R T 0 σσ (t) g(Xv T ) + R T t xf(Xv s , s) + Mt(s)χ(s, Xv, B) ds , R T t Pt(s)χ(s, Xv, B) ds dt α(v, Xv, B) T R T 0 R s 0 Pt(s)χ(s, Xv, B), σσ (t) g(Xv T ) + R T t ( xf(Xv s , s )+ Mt(s )χ(s , Xv, B)) ds dt ds α(v, Xv, B) T R T 0 R s 0 σσ (t) g(Xv T ) + R T t ( xf(Xv s , s ) + Mt(s )χ(s , Xv, B)) ds χ(Xv, s, B) , F dt ds α(v, Xv, B) = R T 0 R s 0 1 T σσ (t)E g(Xv T )+ R T t ( xf(Xv s , s )+ Mt(s )χ(Xv, s , B)) ds χ(Xv, s, B) α(v, Xv, B) , F dt ds. (60) Here, equality (i) holds by Lemma 8 with the choices α(t, s) = Pt(s)χ(Xv, s, B), γ(t) = σσ (t) g(Xv T ) + R T t xf(Xv s , s) + Mt(s)χ(Xv, s, B) ds. Equality (ii) follows from the fact that for any matrix A and vectors b, c, Ab, c = c Ab = Tr(c Ab) = Tr(Abc ) = B, cb F , where , F denotes the Frobenius inner product. The first-order necessary condition for optimality states that at the optimal M , the first variation δG δ M ( M ) is zero. In other words, ϵV( M + ϵP)|ϵ=0 is zero for any P. Hence, the right-hand side of (60) must be zero for any P, which implies that almost everywhere with respect to t [0, T], s [s, T], E g(Xv T )+ R T t ( xf(Xv s , s )+ Mt(s )χ(Xv, s , B)) ds χ(Xv, s, B) α(v, Xv, B) = 0. To derive this, we also used that σ(t) is invertible by assumption. Define the integral operator Tt : L2([t, T]; Rd d) L2([t, T]; Rd d) as [Tt( Mt)](s) = R T t Mt(s )E χ(Xv, s , B)χ(Xv, s, B) α(v, Xv, B) ds If we define Nt(s) = E g(Xv T ) + R T t xf(Xv s , s ) ds χ(Xv, s, B) α(v, Xv, B) , the problem that we need to solve to find the optimal Mt is Tt( Mt) = Nt. This is a Fredholm equation of the first kind. Lemma 8. If α, β : [0, T] [0, T] Rd, γ : [0, T] Rd, δ : [0, T] Rd d are arbitrary integrable functions, we have that R T 0 R T t α(t, s) ds, γ(t) dt = R T 0 R s 0 α(t, s), γ(t) dt ds, Proof. We have that: R T 0 R T t α(t, s), γ(t) ds dt (i)= R T 0 R T t 0 α(t, T s), γ(t) ds dt (ii) = R T 0 R t 0 α(T t, T s), γ(T t) ds dt (iii) = R T 0 R T s α(T t, T s), γ(T t) dt ds (iv) = R T 0 R T T s α(T t, s), γ(T t) dt ds (v)= R T 0 R s 0 α(t, s), γ(t) dt ds Here, in equalities (i), (ii), (iv) and (v) we make changes of variables of the form t 7 T t, s 7 T s, s 7 T s . In equality (iii) we use Fubini s theorem. E Control warm-starting We introduce the Gaussian warm-start, a control warm-start strategy that we adapt from [56], and that we use in our experiments in Figure 7. Their work tackles generalized Schrödinger bridge problems, which are different from the control setting in that the final distribution is known and there is no terminal cost. The following proposition, that provides an analytic expression of the control needed for the density of the process to be Gaussian at all times, is the foundation of our method. Proposition 6. Given Z N(0, I) define the random process Y as Yt = µ(t) + Γ(t)Z, where µ(t) Rd, Γ(t) = tΓ(t) Rd d. (61) Define the control u : Rd [0, T] Rd as u(x, t) = σ(t) 1 tµ(t) + tΓ(t) Γ(t) 1 + I (σσ )(t)(ΣΣ ) 1(t) 2t (x µ(t)) b(x, t) .(62) Then, if Γ0 = σ(0), the controlled process Xu defined in equation (2) has the same marginals as Y . That is, for all t [0, T], Law(Yt) = Law(Xu t ). Proof. Following [56], we have that t Xt = tµt + t Γ(t)Z = tµ(t) + ( t Γ(t)) Γ(t) 1(Xt µ(t)), log pt(x) = Σ(t) 1(x µ(t)), Σ(t) = Γ(t) Γ(t) . Now, pt satisfies the continuity equation equation tpt = (( tµ(t) + ( t Γ(t)) Γ(t) 1(x µ(t)))pt) (63) Let D(t) = 1 2σ(t)σ(t) . We want to reexpress (63) as a Fokker-Planck equation of the form tpt = (v(x, t)pt)+Pd i=1 Pd j=1 i j(Dij(t)pt)= (v(x, t)pt)+Pd i=1 i Pd j=1(Dij(t) jpt) = (v(x, t)pt) + (D(t) pt) = (v(x, t)pt) + (D(t) log pt(x)pt) = ((v(x, t) D(t) log pt(x))pt). Hence, we need that v(x, t) D(t) log pt = tµ(t) + ( t Γ(t)) Γ(t) 1(x µ(t)), = vt(x) = tµ(t) + (( t Γ(t)) Γ(t) 1(x µ(t)) + (σσ )(t) 2 log pt(x) = tµ(t) + ( t Γ(t)) Γ(t) 1(x µ(t)) (σσ )(t) 2 Σ(t) 1(x µ(t)). If we let Γ(t) = Γ(t) t, then Σ(t) = tΓ(t)Γ(t) = tΣ(t) and t Γ(t) = tΓ(t) t . That is, v(x, t) = tµ(t) + tΓ(t) t (x µ(t)) (σσ )(t) = tµ(t) + tΓ(t) Γ(t) 1(x µ(t)) + 1 2t(x µ(t)) (σσ )(t)Σ(t) 1 2t (x µ(t)) For v to be finite at t = 0, we need that (σσ )(0)Σ(0) 1 = I, which holds, for example, if Γ(0) = σ(0). Also, to match the form of (2), we need that v(x, t) = b(x, t) + σ(t)u(x, t), = u(x, t) = σ(t) 1 tµt + tΓ(t) Γ(t) 1 + I (σσ )(t)Σ(t) 1 2t (x µt) b(x, t) . The warm-start control is computed as the solution of a Restricted Gaussian Stochastic Optimal Control problem, where we constrain the space of controls to those that induce Gaussian paths as described in Prop. 6. In practice, we learn a linear spline µ = (µ(b))B b=0, where µ(b) Rd, and a linear spline Γ = (Γ(b))B b=0, where Γ(b) Rd d. These linear splines take the role of µ(t) and Σ(t) in (61). Given splines µ and Γ, we obtain the warm-start control using (62); for a given t [0, T), if we let b = Bt/T , b+ = b + 1, = T/B, we have that bµ(t) = (t b )µ(b+)+(b+ t)µ(b ) , d tµ(t) = µ(b+) µ(b ) bΓ(t) = (t b )Γ(b+)+(b+ t)Γ(b ) , d tΓ(t) = Γ(b+) Γ(b ) ˆu(x, t) = σ(t) 1 d tµ(t) + d tΓ(t)bΓ(t) 1 + I (σσ )(t)(bΣbΣ ) 1(t) 2t (x bµ(t)) b(x, t) .(66) Algorithm 3 provides a method to learn the splines µ, Γ. It is a stochastic optimization algorithms in which the spline parameters are updated by sampling Yt in (61) at different times, computing the control cost relying on (66), and taking its gradient. Algorithm 3 Restricted Gaussian Stochastic Optimal Control Input: State cost f(x, t), terminal cost g(x), diffusion coeff. σ(t), base drift b(x, t), noise level λ, number of iterations N, batch size m, number of time steps K, number of spline knots B, initial mean spline knots µ0 = (µ(b) 0 )B b=0, initial noise spline knots Γ0 = (Γ(b) 0 )B b=0. 1 for n = 0 : (N 1)} do 2 Sample m i.i.d. variables (Zi)n i=1 N(0, I) and m times (ti)n i=1 Unif([0, T]). 3 for j = 0 : K do 4 Set tj = j T/K, and compute bµn(tj), d tµn(tj), bΓn(tj), d tΓn(tj) according to (64), (65) using µn, Γn 5 for i = 1 : m do compute Yij = ˆµ(tj) + tjbΓ(tj)Zi and ˆun(Yij, tj) using (66); 7 Compute ˆLRGSOC(µn, Γn) = 1 m Pm i=1 T K PK 1 j=0 1 2 ˆu(Yij, tj) 2 + f(Yij, tj) + g(Yi K) 8 Compute the gradient of ˆLRGSOC(µn, Γn) with respect to the spline parameters (µn, Γn). 9 Obtain µn+1, Γn+1 with via an Adam update on µn, Γn resp. (or another stochastic algorithm) Output: Learned splines µN, ΓN, control ˆu N Once we have access to the restricted control ˆu N, we can warm-start the control in Algorithms 1 and 2 by introducing ˆu N as an offset. That is, we parameterize the control as uθ = ˆu N + uθ. F Experimental details and additional plots F.1 Experimental details The control L2 error curves show the following quantity: Et,Pu [ u (Xu t , t) u(Xu t , t) 2e λ 1V (Xu 0 ,0)]/Et,Pu [e λ 1V (Xu 0 ,0)] That is, we sample trajectories using the optimal control, and compute the error using a Monte Carlo estimate. In all our experiments, the distribution Xu 0 is a delta, which means that we do not need to compute V (Xu 0 , 0). We keep an exponential moving average (EMA) estimate of the control L2 error, which we show in the plots. To compute it, we sample ten batches of optimally controlled trajectories every 10 training iterations, and we update the quantity with the average of the ten batches, using EMA coefficient 0.02. All other quantities shown in the plots are also smoothed out using EMA with coefficient 0.01, except for control objective values, which are computed as the average of 65536 samples, every 5000 training steps. For all losses and all settings, we train the control using Adam with learning rate 1 10 4. For SOCM, we train the reparametrization matrices using Adam with learning rate 1 10 2. We use batch size m = 128 unless otherwise specified. When used, we run the warm-start algorithm (Algorithm 3) with B = 20 knots, K = 200 time steps, and batch size m = 512, and we use Adam with learning rate 3 10 4 for N = 60000 iterations. QUADRATIC ORNSTEIN-UHLENBECK The choices for the functions of the control problem are: b(x, t) = Ax, f(x, t) = x Px, g(x) = x Qx, σ(t) = σ0. where Q is a positive definite matrix. Control problems of this form are better known as linear quadratic regulator (LQR) and they admit a closed form solution [78, Thm. 6.5.1]. The optimal control is given by: u t (x) = 2σ 0 Ftx, where Ft is the solution of the Ricatti equation dt + A Ft + Ft A 2Ftσ0σ 0 Ft + P = 0 with the final condition FT = Q. Within the QUADRATIC OU class, we consider two settings: Easy: We set d = 20, A = 0.2I, P = 0.2I, Q = 0.1I, σ0 = I, λ = 1, T = 1, xinit = 0.5N(0, I). We do not use warm-start for any algorithm. We take K = 50 time discretization steps, and we use random seed 0. Hard: We set d = 20, A = I, P = I, Q = 0.5I, σ0 = I, λ = 1, T = 1, xinit = 0.5N(0, I). We use the Gaussian warm-start (App. E). We take batch size m = 64 and K = 150 time discretization steps, and we use random seed 0. LINEAR ORNSTEIN-UHLENBECK The functions of the control problem are chosen as follows: b(x, t) = Ax, f(x, t) = 0, g(x) = γ, x , σ(t) = σ0. The optimal control for this class of problems is given by [59, Sec. A.4]: u t (x) = σ 0 e A (T t)γ. We use exactly the same functions as [59]: we sample (ξij)1 i,j d once at the beginning of the simulation, and set: d = 10, A = I + (ξij)1 i,j d, γ = 1, σ0 = I + (ξij)1 i,j d, T = 1, λ = 1, xinit = 0.5N(0, I). We take K = 100 time discretization steps, and we use random seed 0. DOUBLE WELL We also use exactly the same functions as [59], which are the following: b(x, t) = Ψ(x), Ψ(x) = i=1 κi(x2 i 1)2, f(x) = 0, g(x) = i=1 νi(x2 i 1)2, σ0 = I, where d = 10, and κi = 5, νi = 3 for i {1, 2, 3} and κi = 1, νi = 1 for i {4, . . . , 10}. We set T = 1, λ = 1 and xinit = 0. We take K = 200 time discretization steps, and we use random seed 0. The Double Well problem is actually highly non-trivial, and is multimodal. The only reason we can produce a "ground truth" control to compare to in this setting is that we use significant knowledge of the problem; we analytically reduce it to 1D problems by decoupling each dimension and apply numerical methods to solve the Hamilton-Jacobi-Bellman equation for these 1D problems. It is not a problem where we actually have the ground truth control in closed form. PATH INTEGRAL SAMPLER ON MIXTURE OF GAUSSIANS We set b(x, t) = 0, f(x, t) = 0, g(x) = log(µ0(x)/µ(x)) = x 2 2 log(2π) log µ(x), where T = 1, and µ is the density of a mixture of two Gaussians with means e1, where e1 = (1, 0, . . . , 0), and variance Id. Note that we take µ to be normalized, i.e. R µ(x) dx = 1, or equivalently, log Z = log R µ(x) dx = 0. In Figure 1, we use the following Monte Carlo estimator of the control objective at the control u: ˆSu(X) = R T 0 1 2 u(Xu t , t) 2 + f(Xu t , t) dt + g(Xu T ) + R u(Xu t , t), d Bt . Note that this estimator is unbiased because E[ R u(Xu t , t), d Bt ] = 0. This is known as the Sticking the Landing estimator, as it has zero variance when u is the optimal control [72]. The fact that E[ ˆSu(X)] log Z = 0 with equality when u = u is stated as [84, Thm. 4]. F.2 Model architectures As a general guideline, the control function can be thought of as the analog of the score function in diffusion models; hence, a natural choice for the architecture can be U-Nets or diffusion transformers if the control task is on images, audio or video. Other domains may require different architectures. In the experiments we report, we used the architecture implemented in the class Fully Connected UNet within the file SOC_matching/models.py. It is a simplified version of the U-Net architecture where both the down-sampling and up-sampling layers are fully connected with Re LU activations, and the horizontal layers are linear transformations. We use three down-sampling and up-sampling steps, with widths 256, 128 and 64 (hence, the first down-sampling step is actually an up-sampling, because the data dimensions in our experiments range from 10 to 20). The reparameterization matrices have an unusual trait, which is that their input dimension is small (two) while their output dimension is large (d2). Hence, the kind of functions that they need to learn are low dimensional and hence easy. In our case, we used the architecture implemented in the class Sigmoid MLP within the file SOC_matching/models.py, which is essentially a three layer multilayer perceptron with Re LU activations and output dimension d2, whose output is averaged with the identity matrix using sigmoid weights, in order to enforce that Mt(t) be the identity matrix. F.3 Additional tables and plots Table 1 shows the average times per iteration for each algorithm. Each algorithm was run using a 16GB V100 GPU. SOCM SOCM Mt = I SOCM adj. Adj. Cross entropy Log-variance Moment Variance 0.222 0.090 0.099 0.169 0.086 0.117 0.087 0.086 Table 1: Time per iteration (exponential moving average) for various algorithms in seconds per iteration, for the QUADRATIC OU (EASY) experiments (Figure 2). Figure 4 shows the control objective (1) for the four settings. The error bars for the control objective plots show the confidence intervals for one standard deviation, computed via a Monte Carlo estimate using 65536 trajectories per data point. They show the standard error of the mean. As expected, SOCM also obtains the lowest values for the control objective, up to the estimation error. Figure 5 shows the normalized standard deviation of the importance weight for the learned control u: p Var[α(u, Xu, B)]/E[α(u, Xu, B)]. By Lemma 1, when Xu 0 = xinit for an arbitrary xinit (which is the case for all our experiments), this quantity is zero for the optimal control u . Hence, the normalized standard deviation of α is an alternative metric to measure the optimality of the learned control. Figure 6 shows an exponential moving average of the norm squared of the gradient for LINEAR OU and DOUBLE WELL. For LINEAR OU, the minimum gradient norm is achieved by the adjoint method, while for DOUBLE WELL it is achieved by the cross entropy loss. The training instabilities of the adjoint method become apparent as well. Interestingly, in both settings the algorithms with smallest gradients are not SOCM, which is the algorithm with smallest error as shown in Figure 3. Understanding this phenomenon is outside of the scope of this paper. Figure 7 shows plots of the control L2 error, the norm squared of the gradient, and the control objective for the QUADRATIC OU (HARD) setting, using a warm-start strategy detailed in App. E. Figure 7 shows that SOCM is once again the algorithm that achieves the lowest error and the smallest gradients. Remark that the warm-start control is a reasonable approximation of the optimal control, as the initial control L2 error is much lower than in the other figures. Figure 8 shows the value of the training loss for SOCM and its two ablations: SOCM with constant Mt = I, and SOCM-Adjoint. For all such algorithms, the training loss is the sum of the L2 error of the learned control u, and the expected conditional variance of the matching vector field w. Thus, the difference between the training loss plots and the L2 error plots is the expected conditional variance of w. We observe that the expected conditional variance in the QUADRATIC OU setting is orders of magnitude smaller for SOCM than for its two ablations. For LINEAR OU, SOCM and SOCM-adjoint have similar expected conditional variance, and a possible explanation is that the LINEAR OU setting is very simple. In the DOUBLE WELL setting, the SOCM-adjoint training loss curve has spikes that are probably caused by instabilities of the adjoint method. These spikes can be attributed mostly to the expected conditional variance term, since the corresponding L2 error curve in Figure 3 does not present them. Figure 9 shows that the instabilities of the adjoint method are inherent to the loss, because they also appear at small learning rates: 3 10 5 is smaller than the learning rates typically used for Adam, which hover from 1 10 4 to 1 10 3. Figure 4: Plots of the control objective for the four settings. Figure 5: Plots of the normalized standard deviation of the importance weights: p Var[α(u, Xu, B)]/E[α(u, Xu, B)]. Figure 6: Plots of the norm squared of the gradient for the LINEAR ORNSTEIN UHLENBECK and DOUBLE WELL settings. Figure 7: Plots of the control L2 error, the norm squared of the gradient, and the control objective for the QUADRATIC ORNSTEIN-UHLENBECK (HARD) setting, without using warm-start. Figure 8: Plots of the training loss for SOCM and its two ablations: SOCM with constant Mt = I, and SOCM-Adjoint. Figure 9: Plots of the control L2 error and the norm squared of the gradient for the adjoint method on DOUBLE WELL, for two different values of the Adam learning rate. The instabilities of the adjoint method persist for small learning rates, signaling an inherent issue with the loss. Neur IPS Paper Checklist Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: In the abstract and introduction we state that we introduce SOCM, a novel algorithm to solve stochastic optimal control problems, and we claim that it outperforms all existing algorithms in three out of four settings. All of these claims are supported by our paper. Guidelines: The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: Yes, in the second-to-last paragraph of Sec. 5 we discuss the main limitation of SOCM (our algorithm), which is that the variance of the importance weight α is too large in certain settings. Regarding the computational efficiency of our algorithm, Table 1 shows a comparison between our algorithm and existing ones. Guidelines: The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All the theoretical results stated in the paper are numbered and have a corresponding detailed proof in the appendices. In the main text, we indicate the precise location of the proofs. We provide a proof sketch for Theorem 1 in the main paper, and a full proof in the appendix. Guidelines: The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We provide a description of experimental details in Subsec. F.1, and of model architectures in Subsec. F.2. We also provide a link to the Git Hub repository that contains the code for this paper. Guidelines: The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We also provide a link to the Git Hub repository that contains the code for this paper. Guidelines: The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide a description of experimental details in Subsec. F.1, and of model architectures in Subsec. F.2. Guidelines: The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes] Justification: Most of the plots in the paper do not contain error bars, because the lines show exponential moving averages, which means that the error is already smoothed out and very small. The error bars for the control objective plots do show the confidence intervals for one standard deviation, computed via a Monte Carlo estimate using 65536 trajectories per data point. They show the standard error of the mean. We state this in Subsec. F.3. Guidelines: The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We explain the type and amount of GPUs we use in Subsec. F.3. Guidelines: The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We reviewed the Neur IPS Code of Ethics and our research conforms to it. We made sure to preserve our anonymity. Guidelines: The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [No] Justification: The research presented is foundational, and the code released does not have direct societal impact. Yet, it may serve as the basis to develop algorithms that improve the quality of generative models. We state this in Sec. 5. Guidelines: The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [No] Justification: The paper poses no such risks. Guidelines: The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We did not use existing assets. Guidelines: The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: We present code implementing our algorithm. The code folder contains a README which explains how to reproduce the experiments. We describe our algorithm throughout the paper, and provide details in App. F. Guidelines: The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [No] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [No] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines: The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.