# deep_generalized_schrödinger_bridge__71751687.pdf Deep Generalized Schrödinger Bridge Guan-Horng Liu1, Tianrong Chen1 , Oswin So2 , Evangelos A. Theodorou1 1Georgia Institute of Technology, USA 2Massachusetts Institute of Technology, USA {ghliu, tianrong.chen, evangelos.theodorou}@gatech.edu oswinso@mit.edu Mean-Field Game (MFG) serves as a crucial mathematical framework in modeling the collective behavior of individual agents interacting stochastically with a large population. In this work, we aim at solving a challenging class of MFGs in which the differentiability of these interacting preferences may not be available to the solver, and the population is urged to converge exactly to some desired distribution. These setups are, despite being well-motivated for practical purposes, complicated enough to paralyze most (deep) numerical solvers. Nevertheless, we show that Schrödinger Bridge as an entropy-regularized optimal transport model can be generalized to accepting mean-field structures, hence solving these MFGs. This is achieved via the application of Forward-Backward Stochastic Differential Equations theory, which, intriguingly, leads to a computational framework with a similar structure to Temporal Difference learning. As such, it opens up novel algorithmic connections to Deep Reinforcement Learning that we leverage to facilitate practical training. We show that our proposed objective function provides necessary and sufficient conditions to the mean-field problem. Our method, named Deep Generalized Schrödinger Bridge (Deep GSB), not only outperforms prior methods in solving classical population navigation MFGs, but is also capable of solving 1000-dimensional opinion depolarization, setting a new state-of-the-art numerical solver for high-dimensional MFGs. Our code will be made available at https://github.com/ghliu/Deep GSB. 1 Introduction Solving 1000-dim opinion MFG Figure 1: Deep GSB paves a new algorithmic connection between Schrödinger Bridge (SB) and modelbased Deep RL for solving high-dimensional MFGs. On a scorching morning, you navigated through the crowds toward the office. As you walked through a crosswalk, you were pondering the growing public opinion on a new policy over the past week, and were suddenly interrupted by the honking as the traffic started moving... From navigation in crowds to propagation of opinions and traffic movement, examples of individual agents interacting with a large population are widespread in daily life and, due to their prevalence, appear as an important subject in multidisciplinary scientific areas, including economics [1, 2], opinion modeling [3 5], robotics [6, 7], and more recently machine learning [8 10]. Mathematically, the decision-making processes under these scenarios can be characterized by the Mean-Field Game [11 13] (MFG), which models a noncooperative differential game on a finite horizon between a continuum population of rational agents. Let u(x, t) be the value These authors contributed equally. Work was done while Oswin was at Georgia Tech. 36th Conference on Neural Information Processing Systems (Neur IPS 2022). Table 1: Comparison to existing methods w.r.t. various desired features in Mean-Field Games (MFGs). Our Deep GSB is capable of solving a much wider class of MFGs in higher dimensional state spaces. continuous state space stochastic MF dyn. (2) converges to exact ρtarget discontinuous MF interaction F highest dimension Ruthotto et al. [14] 100 Lin et al. [15] 100 Chen [16] 1 2 Deep GSB (ours) 1000 function, also known as optimal cost-to-goal, that governs agents policies at each state x Rd and time t [0, T], and denote the resulting population density by ρ( , t) P(Rd), where P(Rd) is the set of probability measures on Rd. At the Nash equilibrium where no agent has the incentive to change his/her decision, MFG, at its most general form, solves the following partial differential equations (PDEs): ( u(x,t) t + H(x, u, ρ) 1 2σ2 u = F(x, ρ), u(x, T) = G(x, ρ( , T)) t (ρ p H(x, u, ρ) 1 2σ2 ρ = 0, ρ(x, 0) = ρ0(x) , (1) where , , and are respectively the gradient, divergence, and Laplacian operators.2 These two PDEs are respectively known as the Hamilton-Jacobi-Bellman (HJB) and Fokker-Plank (FP) equations, which characterize the evolution of u(x, t) and ρ(x, t). They are coupled with each other through the Hamiltonian H(x, p, ρ) : Rd Rd P(Rd) R, which describes the dynamics of the game, and the mean-field interaction F(x, ρ) : Rd P(Rd) R, which quantifies the agent s preference when interacting with the population. The terminal condition G typically penalizes deviations from some desired target distribution ρtarget, e.g., G DKL(ρ( , T)||ρtarget( )). Given a solution (u, ρ) to (1), each agent acts accordingly and follows a stochastic differential equation(SDE) d Xt = p H(Xt, u(Xt, t), ρ( , t))dt + σd Wt, X0 ρ0, (2) where Wt Rd is the Wiener process and σ R is some diffusion scalar. At the mean-field limit, i.e., when the number of agents goes to infinity, the collective behavior of (2) yields the density ρ( , t). Numerical methods for solving (1) have advanced rapidly with the aid of machine learning. Seminar works such as Ruthotto et al. [14] and Lin et al. [15] approximated (u, ρ) with deep neural networks (DNNs) and directly penalized the violation of PDEs. Despite showing preliminary successes, the underlying dynamics (2) were either degenerate (e.g., σ := 0) [14], or completely discarded by instead regressing network outputs on the entire state space [15], which can scale unfavorably as the dimension d grows. An alternative that avoids both limitations, i.e., it keeps the full stochastic dynamics in (2) while being computationally scalable, is to recast these PDEs to a set of forwardbackward SDEs (FBSDEs) by applying the nonlinear Feynman-Kac Lemma [17 19]. The FBSDEs analysis appears extensively in the theoretical study of MFG [20 23], yet development of scalable FBSDEs-based solver has remained, surprisingly, limited. Our work contributes to this direction. Since ρtarget is known in prior, in many cases there are direct interests to seek an optimal policy that guides the agents from an initial distribution ρ0 to the exact ρtarget, while respecting the structure of MFG, particularly the MF interaction F(x, ρ). Lifting (1) to this setup, however, is highly nontrivial. Indeed, replacing the soft penalty at u(x, T) = DKL(ρ||ρtarget) with a hard distributional constraint at ρ(x, T) = ρtarget yields an HJB whose boundary condition can only be defined implicitly through FP, which now contains two distributional constraints and resembles an optimal transport problem. As such, despite being well-motivated, most prior methods have struggled to extend to this setup. In this work, we show that Schrödinger Bridge (SB), as an entropy-regularized optimal transport problem [24 28], provides an elegant recipe for solving this challenging class of MFGs with distributional boundary constraints (ρ0, ρtarget). Although SB is traditionally set up with F := 0 [29 31], we show that SB-FBSDE [27], an FBSDE-based method for solving SB, can be generalized to accept nontrivial F; hence solving MFG. Interestingly, the new FBSDEs system admits a similar computational structure to temporal difference (TD) learning, leading to a framework that narrows the gap 1Precisely, Chen [16] considered discontinuous yet non-MF interaction, F:=F(x), on a discrete state space. 2These operators are taken w.r.t. x unless otherwise noted. See Appendix A.1 for the notational summary. between SB and Deep Reinforcement Learning (Deep RL); see Fig. 1. This connection enables our method to take advantage of Deep RL techniques, such as target networks, replay buffer, actor-critic, etc, and, more importantly, to handle a wide class of MF interactions that need not be continuous nor differentiable. This is in contrast to most existing works, which require differentiable [14, 15] or quadratic [32] structure on F, or discretize the state space [16]. We validate our method, called Deep Generalized Schrödinger Bridge (Deep GSB), on various challenging MFGs from crowd navigation to high-dimensional opinion depolarization (where d =1000), setting a state-of-the-art record in the area of numerical MFG solvers. In summary, we present the following contributions. We present a novel numerical method, rooted in Schrödinger Bridge (SB), for solving a challenging class of Mean-Field Game where the population needs to converge exactly to the target distribution. The resulting method, Deep GSB, generalizes prior SB results to accepting flexible mean-field interaction (e.g., non-differentiable) and enjoys modern training techniques from Deep RL. Deep GSB achieves promising empirical results in navigating crowd motion and depolarizing 1000-dimensional opinion dynamics, setting a new state-of-the-art numerical MFG solver. 2 Preliminary on Schrödinger Bridge (SB) 0.0 0.5 1.0 time coordinate t 1.0 0.5 0.0 time coordinate s Figure 2: Simulation of the forward (3a) and backward (3b) SDEs in SB, which are minimum-energy solution when (Ψ, bΨ) obey the PDEs in (4). The SB problem was originally introduced in the 1930s for quantum mechanics [29, 33] and later draws broader interests with its connection to optimal transport and control [34 37]. Given a pair of boundary distributions (ρ0, ρT ), SB seeks an optimal pair of stochastic processes of the forms: d Xt = [f(Xt, t) + σ2 log Ψ(Xt, t)]dt + σ d Wt, X0 ρ0, (3a) d Xs = [ f( Xs, s) + σ2 log bΨ( Xs, s)]ds + σ d Ws, X0 ρT . (3b) While Xt is a standard stochastic process starting from ρ0, Xs evolves along the reversed time coordinate s := T t from ρT . The base drift f and diffusion σ are typically known in prior and related to the Hamiltonian H. Suppose Ψ, bΨ C2,1(Rd, [0, T]) solve the following coupled PDEs, ( Ψ(x,t) t = (bΨf) + 1 2σ2 bΨ s.t. Ψ( , 0)bΨ( , 0) = ρ0 Ψ( , T)bΨ( , T) = ρT , (4) then the theory of SB suggests that the SDEs in (3) are optimal solution to an entropy-regularized (i.e., minimum control) optimization problem. Furthermore, the path-wise measure induced by (3a) along t [0, T] is equal almost surely to the path-wise measure induced by (3b) along s := T t. In other words, the two SDEs in (3) can be thought of as the reversed process to each other; and hence we also have XT ρT and XT ρ0 (see Fig. 2). Due to the coupling constraints at the boundaries, solving (4) is no easier than solving (1). Fortunately, recent advances [27, 28] have demonstrated a computationally scalable numerical method via the application of the nonlinear Feynman-Kac (FK) Lemma a mathematical tool that recasts certain classes of PDEs into sets of forward-backward SDEs (FBSDEs) via some transformation. These nonlinear FK transformations are parametrized in SB-FBSDE [27] by some DNNs with θ and φ, i.e., Zθ( , ) σ log Ψ( , ) and b Zφ( , ) σ log bΨ( , ), (5) and the FBSDEs resulting from (4) and (5) yield the following objectives (see Appendix A.2): LIPF(θ) = Z T 2 Zθ( Xs, s) 2 2 + Zθ( Xs, s) b Zφ( Xs, s) + (σZθ( Xs, s)+f) ds, (6a) LIPF(φ) = Z T 2 b Zφ(Xt, t) 2 2 + b Zφ(Xt, t) Zθ(Xt, t) + (σ b Zφ(Xt, t) f) dt. (6b) The following lemma, as a direct consequence of Vargas [38], suggests that these objectives can be interpreted as the KL divergences between the parametrized path measures. Lemma 1. Let qθ and qφ be the path-wise densities of the parametrized forward and backward SDEs d Xθ t = f(Xθ t , t) + σZθ(Xθ t , t) dt + σd Wt, d Xφ s = f( Xφ s , t) + σ b Zφ( Xφ s , t) ds + σd Ws. Then, we have DKL(qθ||qφ) LIPF(φ), and DKL(qφ||qθ) LIPF(θ). Proof. See Appendix A.3.2. Lemma 1 suggests that alternative minimization between LIPF(φ) and LIPF(θ) is equivalent to performing iterative KL projection [39], and is hence equivalent to applying the Iterative Proportional Fitting [40] (IPF) algorithm to solve parametrized SBs [24, 25]. 3 Deep Generalized Schrödinger Bridge (Deep GSB) 3.1 Connection between the coupled PDEs in MFG and SB MFG (1) SB (4) MFG with SB with MF interaction (9) Hopf-Cole Deep GSB Framework constraint (7) Figure 3: Connection between different coupled PDEs appearing in MFG, SB, and Deep GSB. We begin by first stating our problem of interest MFG with hard distributional constraints (ρ0, ρtarget) in its mathematical form. Similar to prior works [14, 15], we will adopt the controlaffine Hamiltonian, H(x, u, ρ) := 1 2 σ u 2 u f(x, ρ), given some base drift f and diffusion scalar σ. Substituting this control-affine Hamiltonian into the PDEs in (1) yields ( u(x,t) 2 σ u 2 u f 1 2σ2 u = F(x, ρ), t (ρ (σ2 u f)) 1 2σ2 ρ = 0, ρ(x, 0) = ρ0(x), ρ(x, T) = ρtarget(x), (7) which, as we briefly discussed in Sec.1, differ from (1) in that the boundary condition of the HJB, u(x, T), is now absorbed into FP and defined implicitly through ρ(x, T) = ρtarget(x). Since analytic conversion between the boundary conditions of (1) and (7) exists only for highly degenerate3 cases [41], this seemingly innocuous change suffices to paralyze most prior methods.4 Nevertheless, as (7) now describes a transformation between two distributions (from FP) while obeying some optimality (from HJB), it suggests a deeper connection to optimal transport, and hence the SB. To bridge these new MFG PDEs (7) to the PDEs appearing in SB (4), we follow standard treatment [30] and apply the Hopf-Cole transform [42, 43]: Ψ(x, t) := exp ( u(x, t)) , bΨ(x, t) := ρ(x, t) exp (u(x, t)) , (8) which, after some algebra (see Appendix A.4.1 for details), yields the following PDEs: ( Ψ(x,t) t = (bΨf) + 1 2σ2 bΨ F bΨ s.t. Ψ( , 0)bΨ( , 0) = ρ0 Ψ( , T)bΨ( , T) = ρtarget . (9) It can be seen that (9) generalizes (4) by introducing the MF interaction F. Let (Ψ, bΨ) be the solution to these new MF-extended PDEs in (9), and recall the Hamiltonian adopted in (7), one can find that p H(Xt, u, ρ) = f σ2 u = f + σ2 log Ψ. That is, the agent s dynamic (2) coincides with the forward SDE (3a) in SB. Hence, we have connected the MFG (1) and SB (4) frameworks through the PDEs in (7) and (9); see Fig. 3. 3Zhang and Chen [41] suggested F := 0, f := f(x) and ρ0 a degenerate Dirac delta distribution. 4For completeness, we note that when the base drift is independent of the density, f := f(x), and mean-field preference, F(ρ) : F(x, ρ) = δF δρ , is convex in ρ, the variational optimization inherited in (7) remains convex. In these cases, the discretized problems converge to the global solution [16, 32]. However, for generic mean-field dynamics, such as the polarized f(x, ρ) in our (18), the problem is in general non-convex; hence only local convergence can be established (see e.g., Remark 1 in [16]). 3.2 Generalized SB-FBSDEs with mean-field interaction With (9), we are ready to present our result that generalizes prior FBSDE for SB to MF interaction. Theorem 2 (Generalized SB-FBSDEs). Suppose Ψ, bΨ C2,1 and let f, F satisfy usual growth and Lipchitz conditions [44, 45]. Consider the following nonlinear FK transformations applied to (9): Yt Y (Xt, t) = log Ψ(Xt, t), Zt Z(Xt, t) = σ log Ψ(Xt, t), b Yt b Y (Xt, t) = log bΨ(Xt, t), b Zt b Z(Xt, t) = σ log bΨ(Xt, t), (10) where Xt follows (3a) with X0 ρ0. Then, the resulting FBSDEs system takes the form: FBSDEs w.r.t. (3a) : d Xt = (ft + σZt) dt + σd Wt 2 Zt 2 + Ft dt + Z t d Wt 2 b Zt 2 + (σ b Zt ft) + b Z t Zt Ft dt + b Z t d Wt Now, consider a similar transformation in (9) but instead w.r.t. the reversed SDE Xs (3b) and X0 ρtarget, i.e., Ys Y ( Xs, s) = log Ψ( Xs, s), and etc. The resulting FBSDEs system reads FBSDEs w.r.t. (3b) : d Xs = fs + σ b Zs ds + σd Ws 2 Zs 2 + (σZs + fs) + Z s b Zs Fs ds + Z s d Ws 2 b Zs 2+Fs ds + b Z s d Ws Since Yt + b Yt = log ρ(X, t) by construction, the functions ft and Ft in (11) take the arguments ft := ft(Xt, exp(Yt + b Yt)) and Ft := Ft(Xt, exp(Yt + b Yt)). Similarly, we have fs := fs( Xs, exp(Ys + b Ys)) and Fs := Fs( Xs, exp(Ys + b Ys)) in (12). Proof. See Appendix A.3.3. Just like how (9) generalizes (4), our results in Theorem 2 also generalize the ones appearing in vanilla SB-FBSDE [27] (see (22) in Appendix A.2) by introducing nontrivial MF interaction F. Despite seemingly complex compared to the original PDEs (9), these FBSDEs systems namely (11) and (12) stand as the foundation for developing scalable numerical methods, as they describe precisely how the values of Y log Ψ and b Y log bΨ shall change along the optimal SDEs (notice, e.g., that both Yt and Zt are functions of Xt from (10)). Essentially, the nonlinear FK Lemma provides a stochastic representation (in terms of Y and b Y ) of the PDEs in (9) by expanding them w.r.t. the optimal SDEs in (3) using the Itô formula [46]. Consequently, rather than solving the PDEs (9) in the entire function space as in the prior work [15], it suffices to solve them locally around high probability regions characterized by (3), which leads to computationally scalable methods. 3.3 Design of the computational framework Looking from Theorem 2, it suffices to approximate Yθ Y and b Yφ b Y with some parametrized functions (we use DNNs), since one may infer Zθ σ Yθ and b Zφ σ b Yφ, as suggested by (10), and then solve for (Xt, Xs) via (11a, 12a). Below, we explore options of designing training objectives for (θ, φ), with the aim to encourage (Yθ, b Yφ) to satisfy the FBSDEs systems in (11, 12). Option 1: LIPF. Given how Theorem 2 generalizes the one in [27] (see (22) in Appendix A.2), it is natural to wonder if adopting the computation used to derive (6), e.g., LIPF(φ) := R E[d Y θ t + db Y φ t ],5 5Additionally, we have LIPF(θ) := R E[d Y θ s + db Y φ s ]; see (24) in Appendix A.2 for the derivation. suffices to reach the FBSDE (11). This is, unfortunately, not the case as one can verify that L(11) IPF (φ) := Z E h d Y θ t + db Y φ t i = Z E 1 2 b Zφ t + Zt θ 2 + (σ b Zφ t f) dt = L(6b) IPF (φ). Despite that (11) differs from (22) by the extra terms +Ft in (11b) and Ft in (11c), the two terms cancel out in the sum of d Y θ t + db Y φ t , thereby yielding the same objectives that do not depend on F. This implies that naively optimizing LIPF from [27] is insufficient for solving FBSDE systems with nontrivial F. We must seek additional objectives, if any, in order to respect the MF structure. Option 2: LIPF + Temporal Difference objective LTD. Let us revisit the relation between the FBSDEs (11, 12) and their PDEs counterparts but this time the HJB in (7). Take (Xt, Yt) for example: The fact that Yt = log Ψ(Xt, t) = u(Xt, t) suggests an alternative interpretation of Yt as the stochastic representation of the HJB, which, crucially, can be seen as the continuous-time analogue of the Bellman equation [47]. Indeed, discretizing (11b) with some fixed step size δt yields Y θ t+δt = Y θ t + 2 Zθ t 2 + Ft δt + Zθ t δWt, δWt N(0, δt I), (13) which resembles a (non-discounted) Temporal Difference (TD) [48, 49] except that, in addition to the standard rewards (in terms of control and state costs), we also have a stochastic term. This stochastic term, which vanishes in the vanilla Bellman equation upon taking expectations, plays a crucial role in characterizing the inherited stochasticity of the value function Yt. With this interpretation in mind, we can construct suitable TD targets for our FBSDEs systems as shown below. Proposition 3 (TD objectives LTD for (11, 12)). The single-step TD targets take the forms: d TD single t+δt :=b Y φ t + 2 b Zφ t 2 + (σ b Zφ t ft) + b Zφ t Zθ t Ft δt + b Zφ t δWt, (14a) TDsingle s+δs :=Y θ s + 2 Zθ s 2 + (σZθ s + fs) + Zθ s b Zφ s Fs δs + Zθ s δWs, (14b) with d TD0 := log ρ0 Y θ 0 and TD0 := log ρtarget b Y φ 0 , and the multi-step TD targets take the forms: d TD multi t+δt := d TD0 + τ=δt δ b Yτ, TDmulti s+δs := TD0 + τ=δs δYτ, (15) where δ b Yt := d TD single t+δt b Yt and δYs := TDsingle s+δs Ys. Given these TD targets, we can construct s=0 E Yθ( Xs, s) TDs δs, LTD(φ) = t=0 E h b Yφ(Xt, t) d TDt i δt. (16) Proof. See Appendix A.3.4. It can be readily seen that the single-step TD targets in (14) obey a similar structure to (13), except deriving from different SDEs (11c, 12b). Doing so reduces the computational overhead, as the related objectives for each parameter, e.g., LIPF(θ) and LTD(θ), can be evaluated from the same expectation. In practice, we find that the multi-step objectives often yield better performance, as consistently observed in the Deep RL literature [50 52]. Additionally, common practices such as computing d TDt and TDs using the exponential moving averaging (i.e., target values) and replay buffers also help stabilize training. Finally, the fact that the TD targets in (16) appear as the regressands implies that from a computational standpoint, the MF interaction F needs not to be continuous or differentiable. Necessity and sufficiency of LIPF + LTD. It remains unclear whether appending LTD to the objective suffices for (Yθ, b Yφ) to satisfy the FBSDEs (11, 12). Below, we provide a positive result. Proposition 4. The functions (Yθ, Zθ, b Yφ, b Zφ) satisfy the FBSDEs (11,12) in Theorem 2 if and only if they are the minimizers of the combined losses L(θ, φ) := LIPF(φ) + LTD(φ) + LIPF(θ) + LTD(θ). Proof. See Appendix A.3.5. Algorithm 1 Deep Generalized Schrödinger Bridge (Deep GSB) Input: (Yθ, b Yφ, σ Yθ, σ b Yφ) for critic or (Yθ, b Yφ, Zθ, b Zφ) for actor-critic parametrization. repeat Sample Xθ {Xθ t , Zθ t , δWt}t [0,T ] from the forward SDE (11a); add Xθ to replay buffer B. for k = 1 to K do Sample on-policy Xθ on and off-policy Xθ off samples respectively from Xθ and B. Compute L(φ) = LIPF(φ; Xθ on) + LTD(φ; Xθ on) + LTD(φ; Xθ off) +LFK(φ; Xθ on). Update φ with the gradient φL(φ). end for Sample Xφ { Xφ s , b Zφ s , δWs}s [0,T ] from the backward SDE (12a); add Xφ to replay buffer B. for k = 1 to K do Sample on-policy Xφ on and off-policy Xφ off samples respectively from Xφ and B. Compute L(θ) = LIPF(θ; Xφ on) + LTD(θ; Xφ on) + LTD(θ; Xφ off) +LFK(θ; Xφ on). Update θ with the gradient θL(θ). end for until converges Proposition 4 asserts the validity of the combined objectives LIPF + LTD in solving the generalized SB-FBSDEs in Theorem 2, and hence the MFG problem in (7). It shall be interpreted as follows: The minimizer of LIPF, as implied in Lemma 1, would always establish a valid bridge transporting between the boundary distributions ρ0 and ρtarget; yet, without further conditions, this bridge needs not obey a Schrödinger bridge. While general IPF and Sinkhorn [16, 32], upon proper initialization or discretization, provides one way to ensure the convergence toward the S B, our Proposition 4 suggests an alternative by introducing the TD objectives LTD. This gives us flexibility to handle generalized SB in MFGs where F becomes nontrivial or non-convex. Further, it naturally handles non-differentiable F, which can offer extra benefits in many cases. Option 3: LIPF + LTD + FK objective LFK. Though it seems sufficient to parametrize (Yθ, b Yφ) then infer Zθ := σ Yθ and b Zφ := σ b Yφ, as suggested in previous options, in practice we find that parametrizing (Zθ, b Zφ) with two additional DNNs then imposing the following FK objective, i.e., s=0 E σ Yθ( Xs, s) Zθ( Xs, s) δs, LFK(φ) = t=0 E h σ b Yφ(Xt, t) b Zφ(Xt, t) i δt, often offers extra robustness. These objectives aim to ensure that the nonlinear FK (10) holds. Our Deep GSB is summarized in Alg. 1. Hereafter, we refer Option 2 and 3 respectively to Deep GSB critic and Deep GSB actor-critic, as Yθ and Zθ play similar roles of critic and actor networks [53, 54]. Remarks on convergence. Despite Alg. 1 sharing a similar alternating structure to IPF [16, 24, 27], the combined objective, e.g., L(φ) DKL(ρθ||ρφ) + Eρθ[LTD(φ)] = DKL(ρφ||ρθ) is not equivalent to the (reversed) KL appearing in IPF. Instead, Deep GSB may be closer to trust region optimization [55], as both iteratively update the policy using samples from the previous stage while subjected to some KL penalty: π(i+1) = arg minπ DKL(π(i)||π) + Eπ(i)[L(π)]. Hence, one can expect Deep GSB to admit similar monotonic improvement and local convergence properties. We leave more discussions to Appendix A.4.2. 4 Experiment Instantiation of MFGs. We validate our Deep GSB on two classes of MFGs, including classical crowd navigation (d=2) and high-dimensional (d=1000) opinion depolarization. For crowd navigation, we consider three MFGs appearing in prior methods [14, 15], including (i) asymmetric obstacle avoidance, (ii) entropy interaction with a V-shape bottleneck, and (iii) congestion interaction on an S-shape tunnel. We will refer to them respectively as GMM, V-neck, and S-tunnel. The obstacles and the initial/target Gaussian distributions (ρ0, ρtarget) are shown in Fig. 4. For opinion depolarization, we set ρ0 and ρtarget to two zero-mean Gaussians with varying variances for representing the initially polarized and desired moderated opinion distributions. Finally, we consider zero and constant base drift f respectively for GMM and V-neck/S-tunnel, and adopt the polarized MF dynamics [4] for opinion MFG; see Sec. 4.2 for a detailed discussion. Figure 4: Crowd navigation MFGs. Table 2: MF interactions for 3 crowd navigation MFGs and the high-dimensional opinion MFG. GMM (d=2) Fobstacle V-neck (d=2) Fobstacle + Fentropy S-tunnel (d=2) Fobstacle + Fcongestion Opinion (d=1000) Fentropy MF interactions F. We follow standard treatments from the MFG theory [11] by noting that given a functional F(ρ) that quantifies the MF cost w.r.t. the population ρ, e.g., Fentropy := Eρ[log ρ] or Fcongestion := Ex,y ρ[ 1 x y 2+1], one can derive its associated MF interaction function F(x, ρ) by taking the functional derivative, i.e., δF(ρ) δρ (x) = F(x, ρ). Hence, the entropy and congestion MF interactions, together with the obstacle cost, follow (see Appendix A.4.3 for the derivation): Fentropy := log ρ(x, t)+1, Fcongestion := Ey ρ , Fobstacle := 1500 1obs(x), (17) where 1obs( ) is the (discontinuous) indicator of the problem-dependent obstacle set. We summarize the MF interaction in Table 2. Architecture & Hyperparameters. We parameterize the functions with fully-connected DNNs for crowd navigation, and deep residual networks for high-dimensional opinion MFGs. All networks adopt sinusoidal time embeddings and are trained with Adam W [56]. All SDEs in (11, 12) are solved with the Euler-Maruyama method. Due to space constraints, we will focus mostly on the results of actor-critic parametrization Deep GSB-ac, and leave the discussion of critic parametrization Deep GSB-c, along with additional experimental details, to Appendix A.5. 4.1 Two-dimensional crowd navigation Figure 5 shows the simulation results of our Deep GSB-ac on three crowd navigation MFGs. We also report existing numerical methods [14 16] that are best-tuned on each MFG (see Appendix A.5.1 for details) but note that in practice, they either require softening F to be differentiable [14, 15] to yield reasonable results, or discretizing the state space [16], which can lead to prohibitive complexity.6 We first compare to Chen [16] on GMM (see Fig. 5a) as their method only applies to non-MF interaction, i.e., F := F(x). While Deep GSB-ac guides the population to smoothly avoid all obstacles (notice the sharp contours of Y around them), [16] struggles to escape due to the discretization of the state space (hence the policy). To better examine the effect of ρ in F(x, ρ), we next simulate the dynamics on V-neck (see Fig. 5b) with and without the MF interaction. It is clear that our Deep GSB-ac encourages the population to spread out once the entropy interaction Fentropy is enabled, yet a similar effect is barely observed in [14]. We observed difficulties in balancing the MF interaction F and the terminal penalty DKL(ρT ||ρtarget) for [14], yet this problem is alleviated in Deep GSB by construction. Lastly, we validate the robustness of our method on S-tunnel (see Fig. 5c) w.r.t. varying diffusions σ = {0.5, 1, 2}. Again, our Deep GSB-ac reaches the same ρtarget despite being subject to different levels of stochasticity. This is in contrast to [15], which, due to discarding the SDE dynamics, necessitates solving PDEs on the entire state space that may be sensitive to hyperparameters. In short, our Deep GSB outperforms prior methods [14 16] by better respecting obstacles and MF interactions yet without losing convergence to ρtarget, and its performance remains robust across different MFGs. 4.2 High-dimensional opinion depolarization Next, we showcase our Deep GSB in solving high-dimensional MFGs in the application of opinion dynamics [3 5], where each agent now possesses a d-dimensional opinion x Rd that evolves through the interactions with the population. In light of increasing recent attention, we consider a particular class of opinion dynamics known to yield strong polarization [4], i.e., the agents opinions tend to partition into groups holding diametric views. Take the party model [4] for instance: Given a random information ξ Rd sampled from some distribution independent of ρ, each agent updates 6As stated in [16], the complexity scales quadratically w.r.t. the number of discretized grid points. Chen [16] (need discrete state space) (a) Deep GSB-ac (ours) interaction 400.000 480.000 interaction Ruthotto et al. [14] (need differentiable Fobs) interaction interaction (b) Deep GSB-ac (ours) Lin et al. [15] (need differentiable Fobs) -250.000 -200.000 -250.000 200.000 (c) Deep GSB-ac (ours) Figure 5: Simulation of the three crowd navigation MFGs, including (a) GMM, (b) V-neck, and (c) S-tunnel, from t = 0 to T. The first five columns show the population snapshots, each with a different color, guided by our Deep GSB-ac, whereas the sixth (rightmost) column overlays the same population snapshots generated by existing methods [14 16]. The time-varying contours represent Yθ log Ψ whose gradient relates to the policy via Z = σ Y . This figure is best viewed in color. 4.5 0.0 4.5 2.5 4.5 0.0 4.5 2.5 2.5 t = 0.33T 4.5 0.0 4.5 2.5 2.5 t = 0.67T 4.5 0.0 4.5 2.5 8 d=1000 (PCA embedding) d=2 d=1000 highly disagree highly agree Simulation of polarize dynamics fpolarize in (18) when d=2 Directional similarity of fpolarize Terminal distribution T after applying Deep GSB-ac Directional similarity after applying Deep GSB-ac Figure 6: (a) Visualization of polarized dynamics fpolarize in 2and 1000-dimensional opinion space, where the directional similarity [3] counts the histogram of cosine angle between pairwise opinions at the terminal distribution ρT . (b) Deep GSB-ac guides ρT toward moderated distributions, hence depolarizes the opinion dynamics. We use the first two principal components to visualize d=1000. the opinion following a normalized polarize dynamic fpolarize = fpolarize/ fpolarize 1 2 , where fpolarize(x, ρ; ξ) := Ey ρ [a(x, y; ξ) y], a(x, y; ξ) := 1 if sign( x, ξ ) = sign( y, ξ ) 1 otherwise , (18) and y = y/ y 1 2 . The agreement function a(x, y; ξ) indicates whether the two opinions x and y agree on the information ξ. Intuitively, the dynamic in (18) suggests that the agents tend to be receptive to opinions they agree with, and antagonistic to opinions they disagree with. As shown in Fig. 6a, this behavioral assumption, also known as biased assimilation [57, 58], can easily lead to polarization. We can apply our MFG framework (7) to this polarized base drift (18), where, starting from some weakly polarized ρ0, we seek a policy that compensates the polarization tendency and helps guide the opinion towards a moderated distribution ρtarget (assuming as Gaussian for simplicity). We consider the entropy MF interaction Fentropy as it encourages opinions diversity before reaching consensus. As shown in Fig. 6b, in both lower- (d=2) and higher- (d=1000) dimensions, our Deep GSB-ac successfully guides the opinion towards the desired distribution centered symmetrically at 0 Rd, thereby mitigates the polarization. Results of Deep GSB-c remain similar despite being more sensitive to hyperparameters; see Appendix A.5.2. We highlight these state-of-the-art results on a challenging class of MFGs that, comparing to existing methods [14, 15], consider a more difficult mean-field dynamic (f(x, ρ) vs. f(x)) in an order of magnitude higher dimension (d=1000 vs. d=100). 4.3 Discussion Table 3: Comparison of Deep GSB-ac vs. Deep GSB-c w.r.t Wasserstein distance to ρtarget and FBSDEs violation, in terms of TD errors and nonlinear FK, averaged over 3 runs. MFGs Deep GSB W2 FBSDEs Violation LTD(φ) LTD(θ) LFK(θ) GMM -ac .27 .16 9.5 2.5 7.1 0.6 5.2 1.1 -c .61 .91 7.0 1.3 10.1 1.6 0.0 0.0 V-neck -ac .00 .00 4.9 1.5 4.1 0.5 0.6 0.2 -c .01 .00 8.2 0.8 8.7 1.6 0.0 0.0 S-tunnel -ac .01 .00 25.5 2.3 28.6 3.6 2.1 0.1 -c .03 .01 30.9 6.9 26.4 5.5 0.0 0.0 Deep GSB-ca vs. Deep GSB-c. Table 3 compares actor-critic with critic parametrizations on crowd navigation MFGs. While Deep GSB-ac typically achieves lower Wasserstein and TD errors, it seldom closes the consistency gap of LFK(θ), as opposed to Deep GSB-c. In practice, the results of Deep GSB-c are visually indistinguishable from Deep GSB-ac, despite the different contours of Y ; see Appendix A.5.2 for more discussions. Deep GSB works with intractable ρtarget. While the availability of the target density ρtarget is a common assumption adopted in prior works [14, 15], in which ρtarget is involved in computing the boundary loss, in most real-world applications, ρtarget is seldom available. Here, we show that Deep GSB works well without knowing ρtarget (and ρ0) so long as we can sample from X0 ρ0 and X0 ρtarget. This is similar to the setup of generative modeling [27]. In Fig. 9 (see Appendix A.5.2), we show that Deep GSB trained without the initial and terminal densities can converge equally well. Crucially, this is because Deep GSB replies on a variety of other mechanisms (e.g., self-consistency in single-step TD objectives and KL-matching in IPF objective) to generate equally informative gradients. This is in contrast to [14, 15] where the training signals are mostly obtained by differentiating through DKL(ρ||ρtarget); consequently, their methods fail to converge in the absence of ρtarget. 5 Conclusion, Limitation, and Boarder Impact We present Deep GSB, a new numerical method for solving a challenging class of MFGs with distributional boundary constraints. By generalizing prior FBSDE theory for Schrödinger Bridge to accepting mean-field interactions, we show that practical training can be achieved via an intriguing algorithmic connection to Deep RL. Our Deep GSB outperforms prior methods in crowd navigation MFGs and sets a new state-of-the-art record in depolarizing 1000-dimensional opinion MFGs. Deep GSB is mainly developed for MFGs in unconstrained state spaces such as Rd. Yet, it may be necessary to adopt domain-specific structures, e.g., constrained state spaces. Additionally, the divergence in the IPF objectives may scale unfavorably as the dimension grows. This may be mitigated by adopting a simpler regression from De Bortoli et al. [24]. Study of Mean-Field Games (MFGs) possesses its own societal influence. Thus, as a MFG solver, Deep GSB may pose a potential impact in offering solutions to previously unsolvable MFGs under more practical settings. Acknowledgments and Disclosure of Funding The authors would like to thank Yu-ting Chiang, Augustinos and Molei for their helpful supports. The authors would also like to thank the anonymous Reviewer A4H3 for his/her initially harsh yet constructive comments on Open Review, which led to substantial improvements of the theoretical results. This research was supported by Do D Basic Research Office Award HQ00342110002. [1] Yves Achdou, Francisco J Buera, Jean-Michel Lasry, Pierre-Louis Lions, and Benjamin Moll. Partial differential equation models in macroeconomics. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 372(2028):20130397, 2014. [2] Yves Achdou, Jiequn Han, Jean-Michel Lasry, Pierre-Louis Lions, and Benjamin Moll. Income and wealth distribution in macroeconomics: A continuous-time approach. The review of economic studies, 89(1):45 86, 2022. [3] Simon Schweighofer, David Garcia, and Frank Schweitzer. An agent-based model of multidimensional opinion dynamics and opinion alignment. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(9):093139, 2020. [4] Jason Gaitonde, Jon Kleinberg, and Éva Tardos. Polarization in geometric opinion dynamics. In Proceedings of the 22nd ACM Conference on Economics and Computation, pages 499 519, 2021. [5] Jan H azła, Yan Jin, Elchanan Mossel, and Govind Ramnarayan. A geometric model of opinion polarization. ar Xiv preprint ar Xiv:1910.05274, 2019. [6] Zhiyu Liu, Bo Wu, and Hai Lin. A mean field game approach to swarming robots control. In 2018 Annual American Control Conference (ACC), pages 4293 4298. IEEE, 2018. [7] Karthik Elamvazhuthi and Spring Berman. Mean-field models in swarm robotics: A survey. Bioinspiration & Biomimetics, 15(1):015001, 2019. [8] Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean-field analysis of deep resnet and beyond: Towards provable optimization via overparameterization from depth. ar Xiv preprint ar Xiv:2003.05508, 2020. [9] Kaitong Hu, Anna Kazeykina, and Zhenjie Ren. Mean-field langevin system, optimal control and deep neural networks. ar Xiv preprint ar Xiv:1909.07278, 2019. [10] E Weinan, Jiequn Han, and Qianxiao Li. A mean-field optimal control formulation of deep learning. ar Xiv preprint ar Xiv:1807.01083, 2018. [11] Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese journal of mathematics, 2(1):229 260, 2007. [12] Olivier Guéant, Jean-Michel Lasry, and Pierre-Louis Lions. Mean field games and applications. In Paris-Princeton lectures on mathematical finance 2010, pages 205 266. Springer, 2011. [13] Alain Bensoussan, Jens Frehse, Phillip Yam, et al. Mean field games and mean field type control theory, volume 101. Springer, 2013. [14] Lars Ruthotto, Stanley J Osher, Wuchen Li, Levon Nurbekyan, and Samy Wu Fung. A machine learning framework for solving high-dimensional mean field game and mean field control problems. Proceedings of the National Academy of Sciences, 117(17):9183 9193, 2020. [15] Alex Tong Lin, Samy Wu Fung, Wuchen Li, Levon Nurbekyan, and Stanley J Osher. Alternating the population and control neural networks to solve high-dimensional stochastic mean-field games. Proceedings of the National Academy of Sciences, 118(31), 2021. [16] Yongxin Chen. Density control of interacting agent systems. ar Xiv preprint ar Xiv:2108.07342, 2021. [17] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34): 8505 8510, 2018. [18] Ioannis Exarchos and Evangelos A Theodorou. Stochastic optimal control via forward and backward stochastic differential equations and importance sampling. Automatica, 87:159 165, 2018. [19] Marcus Pereira, Ziyi Wang, Ioannis Exarchos, and Evangelos A Theodorou. Neural network architectures for stochastic control using the nonlinear feynman-kac lemma. ar Xiv preprint ar Xiv:1902.03986, 2019. [20] René Carmona, François Delarue, and Aimé Lachapelle. Control of mckean vlasov dynamics versus mean field games. Mathematics and Financial Economics, 7(2):131 166, 2013. [21] René Carmona and François Delarue. Mean field forward-backward stochastic differential equations. Electronic Communications in Probability, 18:1 15, 2013. [22] René Carmona and Mathieu Laurière. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games: Ii the finite horizon case. ar Xiv preprint ar Xiv:1908.01613, 2019. [23] René Carmona and Mathieu Laurière. Convergence analysis of machine learning algorithms for the numerical solution of mean field control and games i: the ergodic case. SIAM Journal on Numerical Analysis, 59(3):1455 1485, 2021. [24] Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. ar Xiv preprint ar Xiv:2106.01357, 2021. [25] Francisco Vargas, Pierre Thodoroff, Neil D Lawrence, and Austen Lamacraft. Solving schrödinger bridges via maximum likelihood. ar Xiv preprint ar Xiv:2106.02081, 2021. [26] Gefei Wang, Yuling Jiao, Qian Xu, Yang Wang, and Can Yang. Deep generative learning via schrödinger bridge. ar Xiv preprint ar Xiv:2106.10410, 2021. [27] Tianrong Chen, Guan-Horng Liu, and Evangelos A Theodorou. Likelihood training of schrödinger bridge using forward-backward sdes theory. ar Xiv preprint ar Xiv:2110.11291, 2021. [28] Charlotte Bunne, Ya-Ping Hsieh, Marco Cuturi, and Andreas Krause. Recovering stochastic dynamics via gaussian schr\" odinger bridges. ar Xiv preprint ar Xiv:2202.05722, 2022. [29] Erwin Schrödinger. Sur la théorie relativiste de l électron et l interprétation de la mécanique quantique. In Annales de l institut Henri Poincaré, volume 2, pages 269 310, 1932. [30] Kenneth Caluya and Abhishek Halder. Wasserstein proximal algorithms for the schrödinger bridge problem: Density control with nonlinear drift. IEEE Transactions on Automatic Control, 2021. [31] Julio Backhoff, Giovanni Conforti, Ivan Gentil, and Christian Léonard. The mean field schrödinger problem: ergodic behavior, entropy estimates and functional inequalities. Probability Theory and Related Fields, 178(1):475 530, 2020. [32] Yongxin Chen, Tryphon Georgiou, and Michele Pavon. Optimal steering of inertial particles diffusing anisotropically with losses. In 2015 American Control Conference (ACC), pages 1252 1257. IEEE, 2015. [33] Erwin Schrödinger. Über die umkehrung der naturgesetze. Verlag der Akademie der Wissenschaften in Kommission bei Walter De Gruyter u ..., 1931. [34] Christian Léonard. From the schrödinger problem to the monge kantorovich problem. Journal of Functional Analysis, 262(4):1879 1920, 2012. [35] Christian Léonard. A survey of the schr\" odinger problem and some of its connections with optimal transport. ar Xiv preprint ar Xiv:1308.0215, 2013. [36] Michele Pavon and Anton Wakolbinger. On free energy, stochastic control, and schrödinger processes. In Modeling, Estimation and Control of Systems with Uncertainty, pages 334 348. Springer, 1991. [37] Paolo Dai Pra. A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization, 23(1):313 329, 1991. [38] Francisco Vargas. Machine-learning approaches for the empirical schrödinger bridge problem. Technical report, University of Cambridge, Computer Laboratory, 2021. [39] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111 A1138, 2015. [40] Solomon Kullback. Probability densities with given marginals. The Annals of Mathematical Statistics, 39(4):1236 1243, 1968. [41] Qinsheng Zhang and Yongxin Chen. Path integral sampler: a stochastic control approach for sampling. ar Xiv preprint ar Xiv:2111.15141, 2021. [42] Eberhard Hopf. The partial differential equation ut+ uux= µxx. Communications on Pure and Applied mathematics, 3(3):201 230, 1950. [43] Julian D Cole. On a quasi-linear parabolic equation occurring in aerodynamics. Quarterly of applied mathematics, 9(3):225 236, 1951. [44] Jiongmin Yong and Xun Yu Zhou. Stochastic controls: Hamiltonian systems and HJB equations, volume 43. Springer Science & Business Media, 1999. [45] Magdalena Kobylanski. Backward stochastic differential equations and partial differential equations with quadratic growth. Annals of probability, pages 558 602, 2000. [46] Kiyosi Itô. On stochastic differential equations, volume 4. American Mathematical Soc., 1951. [47] Richard Bellman. The theory of dynamic programming. Technical report, Rand corp santa monica ca, 1954. [48] Emanuel Todorov. Efficient computation of optimal actions. Proceedings of the national academy of sciences, 106(28):11478 11483, 2009. [49] Michael Lutter, Shie Mannor, Jan Peters, Dieter Fox, and Animesh Garg. Value iteration in continuous actions, states and time. ar Xiv preprint ar Xiv:2105.04682, 2021. [50] Lingheng Meng, Rob Gorbet, and Dana Kuli c. The effect of multi-step methods on overestimation in deep reinforcement learning. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 347 353. IEEE, 2021. [51] Harm van Seijen. Effective multi-step temporal-difference learning for non-linear function approximation. ar Xiv preprint ar Xiv:1608.05151, 2016. [52] Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, 2018. [53] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. [54] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. [55] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889 1897, 2015. [56] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017. [57] Charles G Lord, Lee Ross, and Mark R Lepper. Biased assimilation and attitude polarization: The effects of prior theories on subsequently considered evidence. Journal of personality and social psychology, 37(11):2098, 1979. [58] Pranav Dandekar, Ashish Goel, and David T Lee. Biased assimilation, homophily, and the dynamics of polarization. Proceedings of the National Academy of Sciences, 110(15):5791 5796, 2013. [59] Etienne Pardoux and Shige Peng. Backward stochastic differential equations and quasilinear parabolic partial differential equations. In Stochastic partial differential equations and their applications, pages 200 217. Springer, 1992. [60] Balint Negyesi, Kristoffer Andersson, and Cornelis W Oosterlee. The one step malliavin scheme: new discretization of bsdes implemented with deep learning regressions. ar Xiv preprint ar Xiv:2110.05421, 2021. [61] Qianxiao Li and Shuji Hao. An optimal control approach to deep learning and applications to discrete-weight neural networks. In International Conference on Machine Learning, pages 2985 2994. PMLR, 2018. [62] Guan-Horng Liu, Tianrong Chen, and Evangelos A Theodorou. Ddpnopt: Differential dynamic programming neural optimizer. In International Conference on Learning Representations, 2021. [63] Guan-Horng Liu, Tianrong Chen, and Evangelos A Theodorou. Second-order neural ode optimizer. In Advances in Neural Information Processing Systems, 2021. [64] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020. [65] Edward Nelson. Dynamical theories of Brownian motion, volume 106. Princeton university press, 2020. [66] Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. [67] Grigorios A Pavliotis. Stochastic processes and applications: diffusion processes, the Fokker Planck and Langevin equations, volume 60. Springer, 2014. [68] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. ar Xiv e-prints, pages ar Xiv 2101, 2021. [69] Chin-Wei Huang, Jae Hyun Lim, and Aaron Courville. A variational perspective on diffusionbased generative models and score matching. ar Xiv preprint ar Xiv:2106.02808, 2021. [70] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3 11, 2018. [71] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Sec.5. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Sec.5. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] See Appendix A.3. 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We provide the pseudocode in Alg. 1 and detail the hyperparameter in Table 6. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Appendix A.5.1. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Table 3. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix A.5.1. 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] See Appendix A.5.1. (c) Did you include any new assets either in the supplemental material or as a URL? [Yes] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] Not applicable. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] Not applicable. 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] Not applicable. (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] Not applicable. (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] Not applicable.