# dynamic_game_theoretic_neural_optimizer__b093e325.pdf Dynamic Game Theoretic Neural Optimizer Guan-Horng Liu 1 2 Tianrong Chen 3 Evangelos A. Theodorou 1 2 The connection between training deep neural networks (DNNs) and optimal control theory (OCT) has attracted considerable attention as a principled tool of algorithmic design. Despite few attempts being made, they have been limited to architectures where the layer propagation resembles a Markovian dynamical system. This casts doubts on their flexibility to modern networks that heavily rely on non-Markovian dependencies between layers (e.g. skip connections in residual networks). In this work, we propose a novel dynamic game perspective by viewing each layer as a player in a dynamic game characterized by the DNN itself. Through this lens, different classes of optimizers can be seen as matching different types of Nash equilibria, depending on the implicit information structure of each (p)layer. The resulting method, called Dynamic Game Theoretic Neural Optimizer (DGNOpt), not only generalizes OCTinspired optimizers to richer network class; it also motivates a new training principle by solving a multi-player cooperative game. DGNOpt shows convergence improvements over existing methods on image classification datasets with residual and inception networks. Our work marries strengths from both OCT and game theory, paving ways to new algorithmic opportunities from robust optimal control and bandit-based optimization. 1. Introduction Attempts from different disciplines to provide a fundamental understanding of deep learning have advanced rapidly in recent years. Among those, interpretation of DNNs as discretetime nonlinear dynamical systems has received tremendous focus. By viewing each layer as a distinct time step, it motivates principled analysis from numerical equations (Weinan, 1Center for Machine Learning 2School of Aerospace Engineering 3School of Electrical and Computer Engineering, Georgia Institute of Technology, USA. Correspondence to: Guan-Horng Liu . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Figure 1. Dynamic game perspective of generic DNN training process, where we treat layer modules as players in a dynamic game and solve for the related Nash equilibria (Best viewed in color). 2017; Lu et al., 2017) to physics (Greydanus et al., 2019). For instance, casting residual networks (He et al., 2016) as a discretization of ordinary differential equations enables fundamental reasoning on the loss landscape (Lu et al., 2020) and inspires new architectures with numerical stability or continuous limit (Chang et al., 2018; Chen et al., 2018). This dynamical system viewpoint also motivates controltheoretic analysis, which further recasts the network weight as control. With that, the training process can be viewed as an optimal control problem, as both methodologies aim to optimize some variables (weights v.s. controls) subjected to the chain structure (network v.s. dynamical system). This connection has lead to theoretical characterization of the learning process (Weinan et al., 2018; Hu et al., 2019; Liu & Theodorou, 2019) and practical methods for hyperparameter adaptation (Li et al., 2017b) or computational acceleration (Gunther et al., 2020; Zhang et al., 2019). Development of algorithmic progress, however, remains relatively limited. This is because OCT-inspired training methods, by construction, are restricted to network class that resembles Markovian state-space models (Liu et al., 2021; Li & Hao, 2018; Li et al., 2017a). This raises questions of their flexibility and scalability to training modern architectures composed of complex dependencies between layers. It is unclear whether this interpretation of dynamical system and optimal control remains suitable, or how it should be adapted, under those cases. In this work, we address the aforementioned issues using dynamic game theory, a discipline of interactive decision making (Yeung & Petrosjan, 2006) built upon optimal con- DGNOpt: Dynamic Game Theoretic Neural Optimizer trol and game theory. Specifically, we propose to treat each layer as a player in a dynamic game connected through the network propagation. The additional dimension gained from multi-player allows us to generalize OCT-inspired methods to accept a much richer network class. Further, introducing game-theoretic analysis, e.g. information structure, provides a novel algorithmic connection between different classes of training methods from a Nash equilibria standpoint (Fig. 1). Unlike prior game-related works, which typically cast the whole network as a player competing over training iteration (Goodfellow et al., 2014; Balduzzi et al., 2018), the (p)layers in our dynamic game interact along the network propagation. This naturally leads to a coalition game since all players share the same objective. The resulting cooperative training scheme urges the network to yield group optimality, or Pareto efficiency (Pardalos et al., 2008). As we will show through experiments, this improves convergence of training modern architectures, as richer information flows between layers to compute the updates. We name our method Dynamic Game Theoretic Neural Optimizer (DGNOpt). Notably, casting the network as a realization of the game has appeared in analyzing the convergence of Back-propagation (Balduzzi, 2016) or contribution of neurons (Stier et al., 2018; Ghorbani & Zou, 2020). Our work instead focuses on developing game-theoretic training methods and how they can be connected to, or generalize, existing optimizers. In summary, we present the following contributions. We draw a novel algorithmic characterization from the Nash equilibria perspective by framing the training process as solving a multi-player dynamic game. We propose DGNOpt, a game-theoretic optimizer that generalizes OCT-inspired methods to richer network class and encourages cooperative updates among layers with an enlarged information structure. Our method achieves competitive results on image classification with residual and inception nets, enabling rich applications from robust control and bandit analysis. 2. Preliminaries Notation: Given a real-valued function Fs indexed by s S, we shorthand its derivatives evaluated on (xs, θs) as xs Fs Fs x, 2 xs Fs Fs xx, and xs θs Fs Fs xθ, etc. Throughout this work, we will preserve n {1, , N} as the player index and t {0, 1, , T 1} as the propagation order along the network, or equivalently the stage sequence of the game (see Fig. 1). We will abbreviate them as n [N] and t [T] for brevity. Composition of functions is denoted by f(g( )) (f g)( ). We use , and to denote pseudo inversion, Hadamard and Kronecker product. A complete notation table can be found in Appendix A. 2.1. Training Feedforward Nets with Optimal Control Let the layer propagation rule in feedforward networks (e.g. fully-connected and convolution networks) with depth T be zt+1 =ft(zt, θt), t [T], (1) where zt and θt represent the vectorized hidden state and parameter at each layer t. For instance, θt := vec([Wt, bt]) for a fully-connected layer, ft(zt, θt) := σ(Wtzt+bt), with nonlinear activation σ( ). Equation (1) can be interpreted as a discrete-time Markovian model propagating the state zt with the tunable variable θt. With that, the training process, i.e. finding optimal parameters {θt : t [T]} for all layers, can be described by Optimal Control Programming (OCP), min θt:t [T ] L := s.t. (1). (2) The objective L consists of a loss φ incurred by the network prediction z T (e.g. cross-entropy in classification) and the layer-wise regularization ℓt (e.g. weight decay). Despite (2) considers only one data point z0, it can be easily modified to accept batch training (Weinan et al., 2018). Hence, minimizing L sufficiently describes the training process. Equation (2) provides an OCP characterization of training feedforward networks. First, the optimality principles to OCP, according to standard optimal control theory, typically involve solving some time-dependent objectives recursively from the terminal stage T. Previous works have shown that these backward processes relate closely to the computation of Back-propagation (Li et al., 2017a; Liu et al., 2021). Further, the parameter update of each layer, θt θt δθt, can be seen as solving these layer-wise OCP objectives with certain approximations. To ease the notational burden, we leave a thorough discussion in Appendix B. We stress that this intriguing connection is, however, limited to the particular network class described by (1). 2.2. Multi-Player Dynamic Game (MPDG) Following the terminology in Yeung & Petrosjan (2006), in a discrete-time N-player dynamic game, Player n commits to the action θt,n at each stage t and seeks to minimize Ln( θn; θ n) := t=0 ℓt,n(θt,1, , θt,N) s.t. xt+1=Ft(xt, θt,1, , θt,N), θt,n θt,n(ηt,n), (3) where θn := {θt,n : t [T]} denotes the action sequence for Player n throughout the game. The set n := {i [N] : i =n} includes all players except Player n. The key components that characterize an MPDG (3) are detailed as follows. Shared dynamics Ft. The stage-wise propagation rule for xt, affected by actions across all players θt,n, n. DGNOpt: Dynamic Game Theoretic Neural Optimizer Payoff/Cost Ln. The objective for each player that accumulates the costs (φn, ℓt,n) incurred at each stage. Information structure ηt,n. A set of information available to Player n at t for making the decision θt,n. The Nash equilibria {( θ 1, , θ N)} to (3) is a set of stationary points where no player has the incentive to deviate from the decision. Mathematically, this can be described by Ln( θ n; θ n) Ln( θn; θ n), n [N], θn Θn, where Θn denotes the set of admissible actions for Player n. When the players agree to cooperate upon an agreement on a set of strategies and a mechanism to distribute the payoff/cost, a cooperative game (CG) of (3) will be formed. CG requires additional optimality principles to be satisfied. This includes (i) group rationality (GR), which requires all players to optimize their joint objective, L := min θ1, , θN PN n=1 Ln( θn; θ n), (4) and (ii) individual rationality (IR), which requires the cost distributed to each player from L be at most the cost he/she will suffer if plays against others non-cooperatively. Intuitively, IR justifies the participation of each player in CG. 3. Dynamic Game Theoretic Perspective 3.1. Formulating DNNs as Dynamic Games In this section, we draw a novel perspective between the three components in MPDG (3) and the training process of generic (i.e. non-Markovian) DNNs. Given a network composed of the layer modules {fi( , θi)}, where θi denotes the trainable parameters of layer fi similar to (1), we treat each layer as a player in MPDG. The network can be converted into the form of Ft by indexing i := (t, n), where t represents the sequential order from network input to prediction, and n denotes the index of layers aligned at t. Fig. 2 demonstrates such an example for a residual block. When the network propagation collapses from multiple paths to a single one, we can consider either duplicated players sharing the same path or dummy players with null action space. Hence, w.l.o.g. we will treat N as fixed over t. Notice that the assignment i := (t, n) may not be unique. We will discuss its algorithmic implication later in 5.2. Once the shared dynamics is constructed, the payoff Ln can be readily linked to the training objective. Since ℓt,n corresponds to the weight decay for layer ft,n, it follows that ℓt,n := ℓt,n(θt,n). Also, we will have φn := φ whenever all (p)layers share the same task,1 e.g. in classification. In short, the network architecture and training objective respectively characterize the structure of a dynamic game and its payoff. 1 One of the examples for multi-task objective is the auxiliary loss used in deep reinforcement learning (Jaderberg et al., 2016). Figure 2. Example of representing a residual block as Ft. Note that xt augments all hidden states across parallel paths. 3.2. Information Structure and Nash Optimality We now turn into the role of information structure ηt,n. Standard game-theoretic analysis suggests that ηt,n determines the type of Nash equilibria inherited in the MPDG (Petrosjan, 2005). Below we introduce several variants that are of our interests, starting from the one with the least structure. Definition 1 (Open-loop Nash equilibrium (OLNE)). Let ηO t,n := {x0} be the open-loop information structure. Then a set of action, {θ t,n : t, n}, provides an OLNE to (3) if θ t,n = arg min θt,n Ht,n(xt, pt+1,n, θt,n, θ t, n), (5) where θ t,n θ t,n(ηO t,n) and Ht,n := ℓt,n + F T t pt+1,n is the Hamiltonian for Player n at stage t. The co-state pt,n is a vector of the same size as xt and can be simulated from the backward adjoint process, (pt,n, p T,n):=(Ht,n x , φn x). The Hamiltonian objective Ht,n varies for each (t, n) and depends on the proceeding stage via co-state pt+1,n. When N=1, (5) degenerates to the celebrated Pontryagin principle (Pontryagin et al., 1962), which provides the necessary condition to OCP (2). This motivates the following result. Proposition 2. Solving θ t,n= arg min Ht,n with the iterative update, θt,n θt,n M Ht,n θ , recovers the descent direction of standard training methods. Specifically, setting Ht,n θ Ht,n θ Ht,n θ Ht,n θ T yields SGD RMSprop . Gauss-Newton Proposition 2 provides a similar OCP characterization (c.f. 2.1) except for a more generic network class represented by Ft. It also gives our first game-theoretic interpretation of DNN training: standard training methods implicitly match an OLNE defined upon the network propagation. The proof (see Appendix C) relies on constructing a set of co-state pt,n such that Ht,n θ θt,n Ht,n gives the exact gradient w.r.t. the parameter of layer ft,n. The degenerate information structure ηO t,n implies that optimizers of this class utilize minimal knowledge available from the game (i.e. network) structure. This is in contrast to the following Nash equilibrium which relies on richer information. DGNOpt: Dynamic Game Theoretic Neural Optimizer Table 1. Dynamic game theoretic perspective of DNN training. Nash Equilibria Information Structure Optimality Principle Class of Optimizer OLNE ηO t,n min Ht,n in (5) Baselines FNE ηC t,n min Qt,n in (6) DGNOpt (ours) GR ηC-CG t,n min Pt in (7) DGNOpt (ours) Definition 3 (Feedback Nash equilibrium (FNE)). Let ηC t,n:={xs : s t} be the closed-loop information structure. Then a set of strategy, {π t,n : t, n}, is called a FNE to (3) if it is the solution to the Isaacs-Bellman equation (6). Vt,n(xt) = min πt,n Qt,n(xt, πt,n, π t, n), VT,n = φn, where Qt,n := ℓt,n + Vt+1,n Ft (6) is the Isaacs-Bellman objective for Player n at stage t. Also, πt,n θt,n(xt; ηC t,n) denotes any arbitrary mapping from xt to θt,n, conditioned on the closed-loop structure ηC t,n. For the closed-loop information structure ηC t,n, each player has complete access to all preceding states until the current stage t. Consequently, it is preferable to solve for a state-dependent, i.e. feedback, strategy π t,n rather than a state-independent action θ t,n as in OLNE. Similar to (5), the Isaacs-Bellman objective Qt,n is constructed for each (t, n), except now carrying a function Vt,n( ) backward from the terminal stage, rather than the co-state. This value function Vt,n summarizes the optimal cost-to-go for Player n from each state xt, provided all afterward stages are minimized accordingly. When N=1, (6) collapses to standard Dynamic Programming (DP; Bellman (1954)), which is an alternative optimality principle parallel to the Pontryagin. For nontrivial N>1, solving the FNE optimality (6) provides a game-theoretic extension for previous DP-inspired training methods, e.g. Liu et al. (2021), to generic (i.e. non-Markovian) architectures. 3.3. Cooperative Game Optimality Now, let us consider the CG formulation. When a cooperative agreement is reached, each player will be aware of how others react to the game. This can be mathematically expressed by the following information structures, ηO-CG t,n := {x0, θ t, n} and ηC-CG t,n := {xs, π t, n : s t}, which enlarge ηO t,n and ηC t,n with additional knowledge from other players, n. We can characterize the inherited optimality principles similar to OLNE and FNE. Take ηC-CG t,n for instance, the joint optimization in GR (4) requires Definition 4 (Cooperative feedback solution). A set of strategy, {π t,n : t, n}, provides an optimal feedback solution to the joint optimization (4) if it solves 0 2.5k 5k 7.5k Training Loss Training MNIST with CB-Net EKFAC (baseline) EKFAC+ηO CG t, n (ours) EKFAC+ηC CG t, n (ours) Train Iteration Figure 3. (a) The cooperative-block network and (b) its training performance on MNIST when the optimizer (EKFAC) is exposed to different information structure ηt,n. Note that by Proposition 2, the EKFAC baseline utilizes only the open-loop structure ηO t,n. Wt(xt) = min πt,n:n [N] Pt(xt, πt,1, , πt,N), (7) WT = PN n=1 φn, where Pt := PN n=1 ℓt,n + Wt+1 Ft is the group-rational Bellman objective at stage t. πt,n θt,n(xt; ηC-CG t,n ) denotes arbitrary mapping from xt to θt,n, conditioned on the cooperative closed-loop structure ηC-CG t,n . Notice that (7) is the GR extension of (6). Both optimality principles solve for a set of feedback strategies, except the former considers a joint objective Pt summing over all players. Hence, it is sufficient to carry a joint value function Wt backward. We leave the discussion on ηO-CG t,n in Appendix C. To emphasize the importance of information structure, consider the architecture in Fig. 3a, where each pair of parallel layers shares the same input and output; hence the network resembles a two-player dynamic game with Ft := ft,1+ft,2. As shown in Fig. 3b, providing different information structures to the same optimizer, EKFAC (George et al., 2018), greatly affects the training. Having richer information tends to achieve better performance. Additionally, the fact that ηO t,n ηC t,n ηC-CG t,n and ηO t,n ηO-CG t,n ηC-CG t,n (8) also implies an algorithmic connection between different classes of optimizers, which we will explore in 4.3. Table 1 summarizes our game-theoretic analysis. Each information structure suggests its own Nash equilibria and optimality principle, which characterizes a distinct class of training methods. We already established the connection between baselines and Ht,n in Proposition 2. In the next section, we will derive methods for solving (Qt,n, Pt). 4. Training DNN by Solving Dynamic Game In this section, we derive a new second-order method, called Dynamic Game Theoretic Neural Optimizer (DGNOpt), that solves (6) and (7) as an alternative to training DNNs. While we will focus on the residual network for its popularity and algorithmic simplicity when deriving the analytic update, we stress that our methodology applies to other architectures. A full derivation is left in Appendix D. DGNOpt: Dynamic Game Theoretic Neural Optimizer 4.1. Iterative Update via Linearization Computing the game-theoretic objectives (Qt,n, Pt) requires knowing (Ft, ℓt,n, φn). Despite they are well-defined from 3.1, carrying Qt,n or Pt as a stage-varying function is computationally impractical even on a relatively lowdimensional system (Tassa et al., 2012), let alone DNNs. Since the goal is to derive an incremental update given partial (e.g. mini-batch) data at each training iteration, we can consider solving them approximately via linearization. Iterative methods via linearization have been widely used in real-time OCP (Pan et al., 2015; Tassa et al., 2014). We adopt a similar methodology for its computational efficiency and algorithmic connection to existing training methods (shown later). First, consider solving the FNE recursion (6) by π t,n arg min Qt,n. We begin by performing secondorder Taylor expansion on Qt,n w.r.t. to the variables that are observable to Player n at stage t according to ηC t,n. 1 δxt δθt,n Qt,n Qt,n x T Qt,n θ T Qt,n x Qt,n xx Qt,n θx T Qt,n θ Qt,n θx Qt,n θθ 1 δxt δθt,n Note that δθt, n does not appear in the above quadratic expansion since it is unobservable according to ηC t,n. The derivatives of Qt,n w.r.t. different arguments follow standard chain rule (recall Qt,n := ℓt,n + Vt+1,n Ft), with the dynamics linearized at some fixed point (xt, θt,n), e.g. Qt,n θ = ℓt,n θ + F t θ TV t+1,n x , Qt,n θx = F t θ TV t+1,n xx F t x. The analytic solution to this quadratic expression is given by π t,n=θt,n δπ t,n, with the incremental update δπ t,n being δπ t,n = kt,n + Kt,nδxt. kt,n := (Qt,n θθ ) Qt,n θ and Kt,n := (Qt,n θθ ) Qt,n θx (9) are called the open and feedback gains. The superscript denotes the pseudo inversion. Note that δπ t,n is only locally optimal around the region where the quadratic expansion remains valid. Since xt augments preceding hidden states (e.g. xt:=[zt, zt 1]T in Fig. 2), (9) implies that preceding hidden states contribute to the update via linear superposition. Substituting the incremental update δπ t,n back to the FNE recursion (6) yields the local expression of the value function Vt,n, which will be used to compute the preceding update δπ t 1,n. Since the computation depends on Vt,n only through its local derivatives V t,n x and V t,n xx , it is sufficient to propagate these quantities rather than the function itself. The propagation formula is summarized in (10). This procedure (line 4-7 in Alg. 1) repeats recursively backward from the terminal to initial stage, similar to Back-propagation. V t,n x = Qt,n x Qt,n xθ kt,n, V T,n x = φn x, V t,n xx = Qt,n xx Qt,n xθ Kt,n, V T,n xx = φn xx. (10) Algorithm 1 Dynamic Game Theoretic Neural Optimizer 1: Input: dataset D, network F {Ft : t [T]} 2: repeat 3: Compute xt by propagating x0 D through F 4: for t = T 1 to 0 do Solve FNE or GR 5: Solve the update δπ t,n with (9) or (11) 6: Solve (V t,n x ,V t,n xx ) or (W t x,W t xx) with (10) or (25) 7: end for 8: Set x 0 = x0 9: for t = 0 to T 1 do Update parameter 10: Apply θt,n θt,n δπ t,n(δxt) with δxt=x t xt 11: Compute x t+1 = Ft(x t, θt,1, , θt,N) 12: end for 13: until converges Derivation for CG follows similar steps except we consider solving the GR recursion (7) by π t,n arg min Pt. Since all players actions are now observable from ηC-CG t,n , we need to expand Pt w.r.t. all arguments. For notational simplicity, let us denote u θt,1, v θt,2 in Fig. 2. In the case when each player minimizes Pt independently without knowing the other, we know the non-cooperative update for Player 2 admits the form2 of δvt = It + Ltδxt. Now, the locallyoptimal cooperative update for Player 1 can be written as δπ t,1 = ekt + e Ktδxt, where (11) ekt := e P t uu(P t u P t uv It), e Kt := e P t uu(P t ux P t uv Lt). Similar equations can be derived for Player 2. We will refer e P t uu:=P t uu P t uv P t vv P t vu as the cooperative precondition. The update (11), despite seemly complex, exhibits intriguing properties. For one, notice that computing the cooperative open gain ekt for Player 1 involves the non-cooperative open gain It from Player 2. In other words, each player adjusts the strategy after knowing the companion s action. Similar interpretation can be drawn for the feedbacks e Kt and Lt. Propagation of (W t x, W t xx) follows similarly as (10) once all players updates are computed. We leave the complete formula in (25) (see Appendix D.1) since it is rather tedious. Let us discuss the role of δxt and how to compute them. Conceptually, δxt can be any deviation away from the fixed point xt where we expand the objectives, Qt,n or Pt. In MPDG application, it is typically set to the state difference when the parameter updates are applied until stage t, δxt := (Ft 1 F0)(x0, {θ + δπ }