# largescale_multiagent_deep_fbsdes__e8271691.pdf Large-Scale Multi-Agent Deep FBSDEs Tianrong Chen 1 Ziyi Wang 2 Ioannis Exarchos 3 Evangelos A. Theodorou 2 4 In this paper we present a scalable deep learning framework for finding Markovian Nash Equilibria in multi-agent stochastic games using fictitious play. The motivation is inspired by theoretical analysis of Forward Backward Stochastic Differential Equations (FBSDE) and their implementation in a deep learning setting, which is the source of our algorithm s sample efficiency improvement. By taking advantage of the permutation-invariant property of agents in symmetric games, the scalability and performance is further enhanced significantly. We showcase superior performance of our framework over the state-of-the-art deep fictitious play algorithm on an inter-bank lending/borrowing problem in terms of multiple metrics. More importantly, our approach scales up to 3000 agents in simulation, a scale which, to the best of our knowledge, represents a new stateof-the-art. We also demonstrate the applicability of our framework in robotics on a belief space autonomous racing problem. 1. Introduction Stochastic Differential Games (SDG) represent a framework for investigating scenarios where multiple players make decisions in a stochastic environment. The theory of differential games dates back to the seminal work of Isaacs (1965) studying two-player zero-sum dynamic games, with the stochastic extension first appearing in Kushner & Chamberlain (1969). A key step in the study of games is to obtain the Nash equilibrium among players (Osborne & Rubinstein, 1994). A Nash equilibrium represents the solution of noncooperative games where two or more players are involved. 1School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA 2Center for Machine Learning, Georgia Institute of Technology, Atlanta, USA 3Department of Computer Science, Stanford University, Stanford, USA 4School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, USA. Correspondence to: Tianrong Chen . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). At the equilibrium, each player cannot gain any benefit by modifying his/her own strategy given opponents strategy. In the context of adversarial multi-objective games, the Nash equilibrium can be represented as a system of coupled Hamilton-Jacobi-Bellman (HJB) equations when the system satisfies the Markovian property. Analytical solutions exist only for few special cases. Therefore, obtaining the Nash equilibrium solution is usually done numerically, and this can become challenging as the number of states/agents increases. Despite extensive theoretical work (Buckdahn & Li, 2008; Ramachandran & Tsokos, 2012), the algorithmic part has received less attention and is limited to special cases of differential games (e.g., Duncan & Pasik-Duncan (2015)), or suffers from the curse of dimensionality (Kushner, 2002). Nevertheless, SDG has a variety of applications including in robotics, autonomy, economics and management. Relevant examples include Mataramvura & Øksendal (2008), which formulates portfolio management as a SDG in order to obtain a market portfolio that minimizes the convex risk measure of a terminal wealth index value, as well as Prasad & Sethi (2004), which investigates optimal advertising spending in duopolistic settings via SDG. Researchers have also been solving differential games via Reinforcement Learning (RL) (Harmon et al., 1995). Kamalapurkar et al. (2016) seeks to combine differential game theory and RL to solve cooperative control problems in which agents are coupled by graphs. Multi-agent RL (MARL) is an extension of RL, which is a more complex task due to internal interactions between agents and external interactions with environment. Independent learning (Tan, 1993) is an approach which assumes other agents are part of environment, and it suffers from the unstable learning problem. Training agents by augmenting their states and actions is called centralized training and executing (Yang et al., 2019), and scalability of it is limited. Another method is centralized training and decentralized execute (CTDE), however the challenge therein lies on how to decompose value function in the execute phase for value-based MARL. Sunehag et al. (2018); Zhou et al. (2019); Rashid et al. (2018); Son et al. (2019); Mahajan et al. (2019) achieve remarkable success in this vein. However, RL usually adopts modelfree learning scheme in discrete time. Here we leverage the HJB PDE which contains information of the dynamics to develop data-efficient algorithm in continuous time. The de- Large-Scale Multi-Agent Deep FBSDEs composition of value function is characterized by decoupled property of HJB PDE by modifying nonlinear Feynman-Kac Lemma (Karatzas & Shreve, 1991) in multi-agent scenario. Another vein of related research marries the mathematical formulation of a differential game with nonlinear PDE. This motivates the algorithmic development for differential games that combines elements of PDE theory with deep learning. Recent encouraging results (Han et al., 2018; Raissi, 2018) in solving nonlinear PDEs with deep learning illustrate the scalability and numerical efficiency of neural networks. The transition from a PDE formulation to a trainable neural network is done via a system of Forward-Backward Stochastic Differential Equations (FBSDEs). Specifically, certain PDE solutions are linked to solutions of FBSDEs, and the latter can be solved using a suitably defined neural network architecture. This is known as the Deep FBSDE approach. Han et al. (2018); Pereira et al. (2019); Wang et al. (2019b) utilize various deep neural network architectures to solve such stochastic systems. However, these algorithms address single agent dynamical systems. Two-player zero-sum games using FBSDEs were initially developed in Exarchos et al. (2019) and transferred to a deep learning setting in Wang et al. (2019a). Recently, Hu (2019) brought deep learning into fictitious play to solve multi-agent non-zero-sum games. Han & Hu (2019) introduced the Deep FBSDEs to a multi-agent scenario and the concept of fictitious play. Furthermore, Han & Long (2020) gives the convergence analysis of the framework. Our work mainly focuses on stochastic differential game with homogeneous agents which can also be seen as the case of mean field game with finite number of agents. Under mild assumptions, the approximation error of mean field game is of order N 1/(d+4) (Carmona & Delarue, 2013) where N is the number of agents and d is the state dimension of individual agent. Our algorithm can mitigate the approximation error when the number of agents is moderate. In this work, we first extend the theoretical analysis in Han & Long (2020) to the scenario where the forward process includes a drift term. Suggested by the theoretic result, we integrate importance sampling (Bender & Moseler, 2010) into the existing framework. Furthermore, by leveraging the property of symmetric game, we introduce an invariant feature representation to reduce the complexity and significantly increase the number of agents the framework can handle. The main contribution of our work is threefold: 1. We theoretically analyze FBSDE with drift term under fictitious play problem setup, and explain the efficiency of importance sampling analytically and numerically. 2. We introduce an efficient Deep FBSDE framework for solving stochastic multi-agent games via fictitious play that outperforms the current state-of-the-art in performance and runtime/memory efficiency on an inter-bank lending/borrowing example. We demonstrate that our approach scales to a much larger number of agents (up to 3,000 agents, compared to 100 in existing works). To the best of our knowledge, this represents a new state-of-the-art. 3. We showcase the applicability of our framework to robotics on a belief-space autonomous racing problem which has larger individual control and state space. The experiments demonstrate that the BSDE has decoupled property and provides the possibility of applications for competitive scenario, and the model can extract informative features with partially observed input data. The rest of the paper is organized as follows: in 2 we present the notation and mathematical preliminaries. In 3 we introduce the Scaled Deep Fictitious Play FBSDE (SDFP-FBSDE) algorithm, with simulation results following in 4. We conclude the paper in 5. 2. Notation and Preliminaries Table 1. Mathematical notation. CHARACTERS DEFINITIONS X quantities for all agents Xi/Xi quantities for ith agent X i quantities from all agents except ith x realization of X Throughout the paper the notation will be: 1. Bold letters denote quantities for all players1 2. (sup)subscript i indicates quantities for player i. 3. Bold letters with (sup)subscript i indicates quantities for all players except i. 4. Lower case letter is the realization of the upper case random variable. Table.1 contains examples for the notation defined above. The baseline used in this paper refers to Han & Hu (2019). 2.1. HJB PDE and FBSDE Fictitious play is a learning rule (Brown, 1951) where each player presumes other players strategies to be fixed. An N-player game can then be decoupled into N individual decision-making problems which can be solved iteratively over M stages. When each agent converges to a stationary strategy at stage m 1, this strategy will become the stationary strategy for other players at stage m. We consider a N-player non-cooperative SDG with dynamics: d Xt = f(Xt, t) + G(Xt, t)U(Xt) dt + Σ(Xt, t)d Wt Xt0 = xt0, (1) 1Agent and player are used interchangeably in this paper Large-Scale Multi-Agent Deep FBSDEs where X = (X1, X2, . . . , XN) is a vector containing the state process of all agents generated by their controls U = (U1, U2, . . . , UN) defined on the space X and U, with Xi Xi Rnx and Ui Ui Rnu. Here, f : X [t0, T] X represents the drift dynamics, G : X [t0, T] X U represents the actuator dynamics, and Σ : X [t0, T] X Rnw represents the diffusion term. The state process is also driven by an nwdimensional independent Brownian motion Wi, and denoted W = (W1, W2, . . . , WN). Given the other agents strategies, the stochastic optimal control problem for agent i under the fictitious play assumption is defined as minimizing the expectation of the cumulative cost functional Ji t Ji t(X, Ui,m; U i,m 1) = g(XT ) + Z T t Ci(Xτ, Ui,m(Xτ); U i,m 1)dτ where g : X R+ is the terminal cost, and Ci : X U R+ is the running cost for the i-th player. In this paper we assume that the running cost is of the form Ci(X, Ui) = qi(X)+ 1 2U T i RUi +XTQUi. We use the double subscript Ui,m to denote the control of agent i at stage m. We can define value function of each player as V i t (Xt) = inf ui,m Ui Ji t(X, Ui,m; U i,m 1) V i T (XT ) = g(XT ). (3) Assume that V i t in eq. (3) is once differentiable w.r.t. t and twice differentiable w.r.t. X. Here we drop the functional dependencies for simplicity. Then, standard stochastic optimal control theory leads to the HJB PDE: V i t + inf ui Ui V i T x (f + GU) + Ci + 1 2tr(V i xxΣΣT) = 0 The function inside the infimum is known as the Hamiltonian function H(t, X, U, Zi). Zi is known as adjoint states and is defined as Zi = ΣTV i x in literature (Yong & Zhou, 1999). If the optimal control can be obtained, by plugging back to eq.4, one can have, V i t + h + V i T x (f + GU0, ) + 1 2tr(V i xxΣΣT) = 0, (5) where hi = Ci + GU ,0. The first and second subscript of U represents for the control taken by ith and ith agent respectively. The denotes the optimal control computed from U i = R 1(GT i V i x + QT i X) and the zero represents for taking zero control U i = 0. For instance, U0, denotes the augmentation of Ui = 0 𝑿! 𝑿" 𝑿#$" 𝑿# Feature Extractor Feature Extractor Feature Extractor Feature Extractor Backbone Backbone Backbone Backbone 𝑈! 𝑈" 𝑈#$& 𝑈#$" Figure 1. FBSDE Network for a single agent. The dashed arrow indicates hidden states propagation if LSTM is chosen as backbone. The dash arrow would disappear when FC is chosen. and U i = U i as (U 1 , , U i 1, 0, U i+1, , U N) and U ,0 = (0, , 0, U i , 0, , 0). The derivation is included in Appendix A for completeness. The value function V i t in the HJB PDE can be related to a set of FBSDEs by non-linear Fayman-Kac Lemma Karatzas & Shreve (1991), d Xt = (f + GU0, )dt + Σd Wt, Xt0 = xt0 (FSDE), d Y i t = hi tdt + Zi td Wt, Y i T = g(XT ), (BSDE), (6) where the backward process corresponds to the value function, and t [t0, T]. The derivation can be found in Appendix B. Here we denote value function V as Y in order to be aligned with classic FBSDE literature. Remark 1. The problem of solving HJB PDE (5) will be transformed to solve a FBSDE system. (6). The ith player will provide zero control in the drift term in the forward process, and the rest of agents will execute the optimal policy according to the Value function. The HJB PDE (5) can be solved by Numerical PDE solvers based on finite element and finite difference methods (Greif, 2017). Howerver, these methods do not scale beyond very few dimensions, as they rely on discretization. 2.2. Deep Fictitious Play FBSDE Controller In Han & Hu (2019), the FBSDE system in (6) is solved by a neural network where each agent has a Deep Fictitious Play FBSDE Controller (DFP-FBSDE) to generate the policy. The neural network architecture is described as Fig.1, where the backbone is a unique fully connected neural network followed by a Batch Norm module (Ioffe & Szegedy, 2015) (hereafter we shorthand this backbone as FC) at each timestep. Since Han & Hu (2019) do not apply feature extraction, it is equivalent to setting feature extractor as identity mapping: ˆFt = I Xt. The output of the network is Z component in functions (6). The use of a neural network allows for the BSDE to be propagated forward (red path) from the initial condition predicted by another FC along with the FSDE (green path). The training set of DFP-FBSDE contains randomly selected initial states x0, and the training labels are defined as the Large-Scale Multi-Agent Deep FBSDEs value function at terminal time Y i T (6) which depends on random variable XT . The training loss (24) is defined as the mean square error between the predicted value ˆY i T propagated by BSDE and the true terminal value. The metrics used on training/test data are summarized in Appendix C. By collecting the FBSDEs from all agents, the final DFPFBSDE framework will be formed and the architecture can be found in Appendix H.4. Furthermore, Pereira et al. (2019) extends Deep FBSDE to robotics applications and Han & Hu (2019) applys Deep FBSDE for searching Nash-Equlibrium with convergence proof (Han & Long, 2020). 3. SDFP-FBSDE In this section we propose a novel SDFP-FBSDE algorithm. Motivated by the convergence analysis of DFP-FBSDE, we introduce an Importance Sampling term to facilitate exploration and accelerate convergence. Scalability is further increased by Invariant Layer Embeddings, leading to an order of magnitude lower time and memory complexity. 3.1. Motivation of Importance Sampling (IS) We first analyze FBSDE (6) with drift term by extending previous work (Han & Long, 2020) where the drift term is ignored. The analysis focuses on one representative agent. Therefore, the agent index is omitted in this subsection. Assumption 1. There exists a measurable function φ : [0, T] X X and Γ : [0, T] U X so that Σ(t, X)φ(t, X) = f(t, X), and Σ(t, X)Γ(t, X) = G(t, X). Assumption 2. For a general FBSDE system, Xt,x T = x + Z T t µsds + Z T t Σsd Ws (FSDE), Y T,x t = g(Xt,x T ) Z T t Hsds + Z T t Zsd Ws (BSDE), (7) The drift function µs, Hs and diffusion function Σ in (7), terminal objective function g( ), and control function U( ) satisfy Lipschitz continuous properties with Lipschitz constants µx, µu, Σx, Hx, Hz, gx, ux respectively. The detailed and formal description can be found in Appendix D.1. Lemma 1. Denote (Xt,x s , Y t,x s , Zt,x s )t s T as the solution for the FBSDE system (7) satisfying assumptions 1 and 2. Denote the difference of Y component at two different states x1 and x2 as: δXt = Xt0,x1 t Xt0,x2 t , δYt = Y t0,x1 t Y t0,x2 t . δZt = Zt0,x1 t Zt0,x2 t (8) Then we can have: |δYT |2 L1|x1 x2|2, |δYt0|2 L2|x1 x2|2, (9) Where L1 and L2 are defined as: L2 = e Hz(T t0) gxeξ(T t0) + Hx eξ(T t0) 1 Σ 1 x H 1 z ξ = I + µx + µuux + Σx, Following arguments in (Ma et al., 2002), one further has, ||Zt||2 S ||Σ||2 S|| x Yt||2 S MΣL2 (11) Where µx, µu, ux, Σx, Hx, Hz, gx are Lipschitz constants defined in Assumption.2. MΣ is the upper bound of Σ. The proof can be found in Appendix D.2. Lemma 1 bridges the connection between the states and their corresponding value functions. Note that the training labels YT are not fixed since it is a function of XT driven by stochastic diffusion Wt which is distinct at each training iteration. Furthermore, the Lemma 1 suggests that scope of the training dataset [x0, Y i T ] and initial values functions Yt0 can be controlled by Lipschtiz constants µx, µu, Hx, Hz, ux, since Lipschtiz constants Σx, gx are determined as long as the stochastic dynamics and terminal loss function are defined. Increasing data complexity is crucial element for improving model performance in RL and supervised learning. In RL, agents are encouraged to explore more states by adding the entropy term in cost function (Haarnoja et al., 2018) or adding artificial stochasticity in observed states (Fortunato et al., 2017) to gain complex training data for the learning of action-value function. In supervised learning, connections between deep neural networks generalization and complexity of dataset also has attracted considerable attention as a principled tool to explain deep learning(DL) (Schmidt et al., 2018),(Rozen et al., 2019). In practice, Data-Augmentation is an efficient approach adopted to improve the generalization performance of a deep learning model (Cubuk et al., 2018),(Fawzi et al., 2016). Motivated by the success of previous work in DL and RL community, we would like to modify the FBSDE system (6) to improve the complexity of training label δYT and δYt0 (9) by encouraging the exploration of agents while solving the same HJB PDE (5). Theorem 1. (Bender & Moseler, 2010): Let (Xi,t,x s , Y i,t,x s , Zi,s,x s ) be the solution of the FBSDE system (6) for ith agent, and let Ks : [0, T] Ω Rnx be any bounded and square integrable process for ith agent. Consider the forward process whose drift term is modified by Ks d Xs = [µs + ΣKs]ds + Σd Ws, Xt = xt, (12) along with the corresponding BSDE d Y i s = [ hi s + Zs Ks]ds + Zi sd Ws, Y i T = g(XT ). (13) Large-Scale Multi-Agent Deep FBSDEs Here we denote ( Xi,t,x s , Y i,t,x s , Zi,s,x s ) as the solution for modified FBSDE system (12,13). For all s [t, T], ( Xi,t,x s , Y i,t,x s , Zi,s,x s ) = (Xi,t,x s , Y i,t,x s , Zi,s,x s ) a.s. If ( Y i,t,x s , Zi,s,x s ) are defined as ( V i, ΣT V i x) with V i being the solution to 5, then V i V i a.e. Theorem 1 is also known as IS in the literature (Exarchos & Theodorou, 2018). By modifying the FSDE and BSDE simultaneously, the HJB PDE will be solved identically almost everywhere. We further bound |δYT |2, |δYt0|2 and ||Zt||2 S with constant L1 and L2 when IS is applied in the FBSDE system in Lemma 2. The detailed description and proof can be found in Appendix D.3. Theorem 2. Denote ( Xt,x s , Y t,x s , Zt,x s )t s T is the solution for the FBSDE system with IS (12,13), and (Xt,x s , Y t,x s , Zt,x s )t s T is the solution for the FBSDE system (7). and they satisfy the assumption (1,2). Then given the identical training data X0 for FBSDE w/ and w/o IS, combining Lemma 1 and Lemma 2, one can have, max |δYT |2 max |δ YT |2, max |δY0|2 max |δ Y0|2 (14) Theorem 2 hints that, given identical training data , the model equipped with IS could obtain more diverse and complex training dataset, which means the generalization performance can potentially be enhanced. Fig.2 testifies the performance difference for the model with and without IS over identical dataset whose size is limited on the inter-bank problem ( 4.1). The metric is chosen as the evaluation loss (24) and Relative Square Error (RSE) loss (23). We can find IS improve the performance for both backbones. In order to highlight the influence of IS term ΣKs and ZKs, we detached/stopped the gradient of them. In other words, the IS term will be treated as a constant during training phase, and the rapid convergence is not the result of the additional gradient path introduced by IS. As a value-based optimal decision algorithm, a natural question will be how the policy performs based on the learnt value function. Proposition 1 demonstrates the convergence rate of Deep FBSDE policies in fictitious play. Proposition 1. (Han & Long, 2020) Denote δum t = um t u t as difference between the policy obtained at mth stage of fictitious play at time step t and the optimal policy, then, 0 E|δum+1 t |2 η(λ) Z T 0 E|δum t |2 (15) Where η(λ) is the convergence rate. The theoretical analysis of proposition 1 uncovers the convergence property of fictitious play of FBSDE system. However, choice of constant λ is tricky in order to obtain strict 1000 2000 3000 4000 Evaluation Loss$ FC+w/o IS (Baseline) FC+w/ IS LSTM+w/o IS LSTM+w/ IS 1000 2000 3000 4000 0 FC+w/o IS (Baseline) FC+w/ IS LSTM+w/o IS LSTM+w/ IS number of initial state samples in training dataset Figure 2. Performance difference of DFP-FBSDE w/ and w/o importance samping over limited training dataset. The simulation is executed on 100 agents inter-bank game. 0 10 20 30 40 Stages FC+w/o IS (Baseline) FC+w/ IS LSTM+w/o IS LSTM+w/ IS 0 10 20 30 40 Stages FC+w/o IS (Baseline) FC+w/ IS LSTM+w/o IS LSTM+w/ IS Figure 3. Numerical value of convergence rate 1 η(λ) and the model performance evalued by RSE. The simulation is executed on 100 agents inter-bank game. converge: 0 η(λ) < 1. We demonstrate the numerical value of 1 η(λ) for Deep FBSDE model w/ and w/o IS with FC backbone (Han et al., 2018) and LSTM backbone (Pereira et al., 2019) in Fig.3. For the same backbone, IS accelerates the convergence of the policy significantly. 3.2. Mitigating Curse of Many Agents Scalability is a crucial criterion of RL. In DFP-FBSDE, each agent is equipped with policy computed by a distinct neural network. As the number of agents increases, the number of neural network copies would increase correspondingly. Meanwhile, the size of each neural network should be enlarged to gain enough capacity to capture the representation of many agents, leading to the infamous curse of many agents (Wang et al., 2020). This limits the scalability of prior works. However, one can mitigate the curse of many agents in this case by taking advantage of the symmetric game setup. We summarize merits of symmetric game as: 1. Since all agents have the same dynamics and cost function, only one copy of the network is needed. The strategies of other agents can be inferred by applying the same network. 2. Thanks to the symmetric property, we can apply the IL embedding (Zaheer et al., 2017) to extract invariant features. Sharing one network: Here we introduce some techniques to improve the time and memory complexity. When sharing same network, we can query the policy of other agents simply by inferring own freeze network at m 1 stage. It s Large-Scale Multi-Agent Deep FBSDEs 0 20 40 60 80 Stages Evaluation Loss 10 1 100 Agents LSTM LSTM+IL 0 20 40 60 80 Stages Evaluation Loss 10 1 500 Agents LSTM LSTM+IL Figure 4. Comparison between DFP-FBSDE w/ and w/o IL. The backbone is chosen as LSTM. The simulation is inter-bank game. important to note that querying the policy of other agents should not introduce additional gradient paths, which significantly reduces the memory complexity. When querying other agents strategy, one can either iterate through each agent or feed all agents states to the network in a batch. Here we denote them as iterative scheme and batch scheme respectively. The latter approach reduces the time complexity by adopting the parallel nature of modern GPU but requires O(N 2) memory complexity rather than O(N) for the first approach. Invariant Layers (IL): A comprehensive introduction of invariant network can be found in (Zaheer et al., 2017). Later applications in robotics (Shi et al., 2020) and RL (Liu et al., 2020), (Wang et al., 2020) verify the success of this technique. The memory complexity can be further reduced with an IL embedding (Zaheer et al., 2017) in our work. The IL utilizes an averaging function along with the features in the same set (known as feature averaging) to render the network invariant to permutation of agents. We apply the IL on X i which is insensitive to permutation and concatenate the resulting features to the features extracted from Xi. However, vanilla IL embedding does not reduce the memory complexity, because the dimension of feature space is commonly high even though the dimension of averaged feature is small. Thanks to the symmetric problem setup, one can apply a technique to reduce the IL memory complexity form O(N 2) to O(N). A detailed introduction to the IL and our implementation techniques can be found in Appendix E. Furthermore, the invariant representation, especially for feature averaging, can increase the performance of deep learning model theoretically (Lyle et al., 2020) when features have the invariant property. Learning the invariant representation for X i will become harder when the number of agents increases. This is illustrated in fig.4 where the DFP-FBSDE model equipped with IL can handle the complex invariant representation, and the performance gap becomes even larger when the number of agents increase. 3.3. Algorithm Following the analysis in 3.1 and 3.2, here we introduce the SDFP-FBSDE algorithm which is scalable up to 3000 101 102 103 number of agents Memory (GB) 101 102 103 number of agents Time per iteration (s) baseline+Iterative baseline+Batch baseline+IL+Batch Ours (LSTM+IL+Batch) Figure 5. Time and memory complexity comparison between batch, iterate and IL+batch implementations. Time complexity is measured by per-iteration time. IL stands for invariant layer. agents and has appreciable time and memory complexity by integrating IS and IL. We empirically verify the time and memory complexity shown in Fig.5. The neural network structure is same as shown in Fig.1. However, the feature extractor module (the blue boxes) will be replaced by an IL whose details can be found in Appendix E.2. Meanwhile, the forward and backward process of X and V i t will be modified according to the Theorem 1. Inspired by previous work (Theodorou et al., 2010), the IS term (Ks = Γ U i) term is selected as the control calculated from previous run of the algorithm. LSTM will be chosen as the backbone in our algorithm from experimental result. Notably, our algorithm will also improve the performance for the framework with FC backbone as used in (Han & Hu, 2019), and the comparison will be elaborated in the simulation section. The full algorithm can be found in Appendix K. 4. Simulation In this section, we demonstrate the capability of SFDPFBSDE on two different systems in simulation. We first apply the framework to an inter-bank lending/borrowing problem, which is a classical multi-player non-cooperative game with an analytical solution. We compare against both the analytical solution and prior work (Han & Hu, 2019). We also apply the framework to a variation of the problem for which no analytical solution exists. Finally, we showcase the general applicability of our framework in an autonomous racing problem in belief space and discuss how BSDE plays an importance role in a SDG. All experiment configurations can be found in Appendix F. we plot the results of 3 repeated runs with different seeds with the line and shaded region showing the mean and mean standard deviation respectively. The hyperparameters and dynamics coefficients used in the inter-bank experiments are the same as (Han & Hu, 2019) unless otherwise noted. Technical differences between Baseline (Han & Hu, 2019) and SDFPFBSDE are shown in Table.4. The hardware used to run all simulations is included in Appendix J. Large-Scale Multi-Agent Deep FBSDEs ALGO BACKBONE BATCH IL IS BASELINE FC+BN SDFP LSTM Table 2. Technical differences between baseline (Han & Hu, 2019) and SDFP. Batch stands for the batch scheme. 0.00 0.25 0.50 0.75 1.00 time(s) 0.00 0.25 0.50 0.75 1.00 time(s) FBSDE Estimated Solution Analytic Solution Figure 6. Comparison of SDFP and analytical solution for the inter-bank problem. Both the state (left) and control (right) trajectories are aligned with the analytical solution (represented by dots). 4.1. Inter-bank lending/borrowing problem We first consider an inter-bank lending/ borrowing model (Carmona et al., 2013) where the dynamics of the logmonetary reserves of N banks is d Xi t = a( Xt Xi t) + U i t dt + σ(ρd W 0 t + p 1 ρ2d W i t ), i=1 Xi t, i I. (16) The state Xi t R denotes the log-monetary reserve of bank i at time t > 0. The control ui t denotes the cash flow to/from a central bank, where as a( X Xi t) denotes the lending/borrowing rate of bank i from all other banks. The system is driven by N independent standard Brownian motion W i t , which denotes the idiosyncratic noise, and a common noise W 0 t . The cost function has the form, Ci t(X, Ui; U i) = 1 2U 2 i q Ui( X Xi) + ϵ 2( X Xi)2. (17) The constants q, ϵ can be found in Appendix F. The derivation of the FBSDEs and analytical solution are in Appendix E. We compare the result of our algorithm on a 10-agent problem with analytical solution and the baseline with the same hyperparameters. Fig. 6 shows the performance of SDFP-FBSDE compared with analytical solution. The state and control trajectories outputted by SDFP-FBSDE solution are aligned closely with the analytical solution. Table 3 shows the numerical performance compared against prior work by Han & Hu (2019). Our method outperforms by RSE (23) and computation wall time. Ablation Experiments: In order to verify the effect of combination of IL and IS introduced in 3.1 and 3.2, we perform the ablation experiments on the 500 agents inter-bank FRAMEWORK RSE TIME (HR) BASELINE 0.0423 2.23 SFDP 0.0105 1.55 Table 3. Comparison with Han & Hu (2019) on the 10-agent interbank problem. 0 10 20 30 40 50 Stages Evaluation Loss LSTM LSTM w/ IL LSTM w/ IS LSTM w/ IL w/IS(Ours) 0 10 20 30 40 50 Stages Evaluation Loss FC (Baseline) FC w/ IL FC w/ IS FC w/ IL w/IS Figure 7. Ablation experiments on LSTM and FC backbone. problem. Fig.7 shows the model equipped with IS and IL would obtain superior convergence rate and better results on evaluation set regardless of the choice of backbone. High Dimension Experiments: Convinced by the ablation experiments, we conducted comparison between baseline and our proposed algorithm on different numbers of agents (up to 3000). Because of the limitation of graphical memory of hardware, the maximum number of agents for baseline is 700. The result is demonstrated in Fig.8. SDFP-FBSDE outperforms the baseline by cumulative total cost (25) and RSE (23). As been shown in FBSDE theory (6), the augmented control is characterized by U0, . This means more agents will lead to more actuator and better explorations hence the RSE loss would decrease when number of agent is smaller than 300. When the number of agents keeps growing, the network would have difficulty of learning many agents representation, thus leading to the increase of RSE. Our method mitigates the curse of many agent, and postpones the RSE turning point to 1000 agents. Notably, SDFP-FBSDE has only marginally larger RSE loss at 3000 agents compared with baseline at 300 agents. The left subplot in Fig.8 shows our algorithm found more reasonable control sequence than baseline. The margin between SDFP-FBSDE and baseline is even larger on evaluation loss (24). Since this plot is not informative because of the large difference of scale, we defer it to Appendix H.1. Superlinear Experiments: We also consider a variant of dynamics in equation (16): d Xi t = a( X Xi t)3 + U i t dt + σ(ρd W 0 t + p 1 ρ2d W i t ), i=1 Xi t, i I. (18) Due to the nonlinearity in the drift term, analytical solution or simple numerical representation of the Nash equilibrium Large-Scale Multi-Agent Deep FBSDEs 101 102 103 number of agents Cumulative Loss 101 102 103 number of agents Baseline SDFP-FBSDE Figure 8. Comparison of SDFP-FBSDE and Baseline for interbank problem with different number of agents evaluated on cumulative loss(25) and RSE(23). 3 2 1 0 1 2 3 Xi Analytic Predict Linear Predict superliear 3 2 1 0 1 2 3 Ui Analytic Predict Linear Predict superliear Figure 9. Terminal time step state X and control U distribution of ith agent for linear and superlinear dynamics. The analytic and predicted distribution for linear case are very similar. does not exist (Han & Hu, 2019). The drift rate a is set to 1.0 to compensate for the vanishing drift term caused by super-linearity. Heuristically, the distribution of control and state should be more concentrated than that of the linear dynamics. We compare the state and control of agent i at terminal time against analytical solution and Deep FBSDE solution of the linear dynamics with the same coefficients. Fig. 9 is generated by evaluating the trained Deep FBSDE model with a batch size of 50000. Fig.9 shows that the solution from superlinear dynamics is more concentrated as expected. The terminal control distribution verifies that the superlinear drift term pushes the state back to the average faster than linear dynamics and thus requires less control effort. Since the numerical solution is not available, we compare the cumulative loss (25) loss and evaluation loss (24) between baseline and our algorithm (see Appendix H.2). 4.2. Extension to Partially Observed Game in Robotics In this section, we demonstrate the general applicability of our framework on an autonomous racing example in belief space, and show how BSDE influences the game when competition is triggered experimentally. we consider a autonomous racing problem with race car dynamics Xi = [v cos θ, v sin θ, uacc cdragv, usteerv/L]T (19) where Xi = [x, y, v, θ]T represent the x, y position, forward velocity and heading of the ith vehicle. Here we assume x, y, v, uacc R, usteer [ 1, 1]. The goal of each player is to drive faster than the opponent, stay on the track and avoid collision. The running cost Ci is defined in Appendix I.2. During the game, players have access to partially observed Car1 w/o competition Car2 w/o competition Car1 w/ competition Car2 w/ competition Car1 w competition Car2 w/o competition Car1 w/o competition Car2 w competition Figure 10. The plots contains 64 trials of racing. The performance varies with respect to the competition loss. global augmented states estimated by an Extended Kalman Filter (EKF) and opponent s controller in the past games. Additionally, we assume that stochasticity enters the system through the control channels and have a continuous-time noisy observation model with full-state observation. The FBSDE derivation of belief space stochastic dynamics is included in Appendix I. We focus on the 2-car racing scenario for the interpretability of results, since no analytical solution exists for the problem. With the 2-car setup, interesting behaviors arise when the cost function of each car is altered. Since all the trials complete 1 lap, here we only show the first 8 seconds result for neatness. Non-cooperative and Non-competitive Case We first test the capability of our framework on a partially observed learning problem where all states are estimated with observation noise and additive noise in the system. Both cars can stay in the track as expected. Since there is no competition between the two cars, they demonstrate similar behaviors. The plot of trajectories with 64 trails of games can be found on the top left of Fig.10. The EKF posterior estimation is not drawn in Fig.10 for neatness. For single trails plot with EKF posterior can be found in Appendix H.3. Competitive Case When we add competition loss (see Appendix I.2) on the cars, the propagation of BSDE (6) will be modified accordingly. Therefore, the racing game demonstrates interesting properties. When competition loss is applied on both cars, both of them try to cut the corner in order to occupy the leading position as shown on the top right of Fig. 10, and both of them travel longer distance compared with the ones do not have competitive loss. When Large-Scale Multi-Agent Deep FBSDEs competition loss is present in only one of the two cars, then the one with competition loss dominates the game as shown in the botton subplots of Fig.10. 5. Conclusion In this paper, we first extend the theoretical analysis from (Han & Long, 2020) and introduce importance sampling to improve the sample efficiency and convergence rate. To further push our work to handle larger number of agents with appreciable time and memory complexity, batch query scheme and invariant layer implementation are proposed. The scalability of our algorithm, along with a detailed sensitivity analysis, is demonstrated in an inter-bank borrowing/lending example. Our framework achieves better performance in different metrics and scales to significantly higher dimensions than the state-of-the-art. The general applicability of our framework is showcased on a belief space racing problem in the partially observed scenario. Acknowledgement This research is supported under ARO W911NF2010151. We would also like to thank Professor Matthieu Bloch and Guan-Horng Liu for the helpful discussions. Bender, C. and Moseler, T. Importance sampling for backward sdes. Stochastic Analysis and Applications, 28(2): 226 253, 2010. Brown, G. W. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1):374 376, 1951. Buckdahn, R. and Li, J. Stochastic differential games and viscosity solutions of hamilton jacobi bellman isaacs equations. SIAM Journal on Control and Optimization, 47(1):444 475, 2008. Carmona, R. and Delarue, F. Probabilistic analysis of meanfield games. SIAM Journal on Control and Optimization, 51(4):2705 2734, 2013. Carmona, R., Fouque, J.-P., and Sun, L.-H. Mean field games and systemic risk. Available at SSRN 2307814, 2013. Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. ar Xiv preprint ar Xiv:1805.09501, 2018. Duncan, T. and Pasik-Duncan, B. Some stochastic differential games with state dependent noise. 54th IEEE Conference on Decision and Control, Osaka, Japan, December 15-18, 2015. Exarchos, I. and Theodorou, E. A. Stochastic optimal control via forward and backward stochastic differential equations and importance sampling. Automatica, 87:159 165, 2018. Exarchos, I., Theodorou, E., and Tsiotras, P. Stochastic differential games: A sampling approach via FBSDEs. Dynamic Games and Applications, 9(2):486 505, 2019. Fawzi, A., Samulowitz, H., Turaga, D., and Frossard, P. Adaptive data augmentation for image classification. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 3688 3692. Ieee, 2016. Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Hassabis, D., Pietquin, O., et al. Noisy networks for exploration. ar Xiv preprint ar Xiv:1706.10295, 2017. Greif, C. Numerical methods for hamilton-jacobi-bellman equations. 2017. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actorcritic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pp. 1861 1870. PMLR, 2018. Han, J. and Hu, R. Deep fictitious play for finding markovian nash equilibrium in multi-agent games. ar Xiv preprint ar Xiv:1912.01809, 2019. Han, J. and Long, J. Convergence of the deep bsde method for coupled fbsdes. Probability, Uncertainty and Quantitative Risk, 5(1):1 33, 2020. Han, J., Jentzen, A., and Weinan, E. Solving highdimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34):8505 8510, 2018. Harmon, M. E., Baird III, L. C., and Klopf, A. H. Reinforcement learning applied to a differential game. Adaptive behavior, 4(1):3 28, 1995. Hu, R. Deep fictitious play for stochastic differential games. ar Xiv preprint ar Xiv:1903.09376, 2019. Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448 456. PMLR, 2015. Isaacs, R. Differential Games: A Mathematical Theory with Applications to Warfare and Pursuit, Control and Optimization. New York: Willey, 1965.