# a_robust_differential_neural_ode_optimizer__a94a55b3.pdf

Published as a conference paper at ICLR 2024

A ROBUST DIFFERENTIAL NEURAL ODE OPTIMIZER

Panagiotis Theodoropoulos Guan-Horng Liu Tianrong Chen Augustinos D. Saravanos Evangelos A. Theodorou Georgia Institute of Technology, USA {ptheodor3, ghliu, tianrong.chen, asaravanos3, evangelos.theodorou}@gatech.edu

Neural networks and neural ODEs tend to be vulnerable to adversarial attacks, rendering robust optimizers critical to curb the success of such attacks. In this regard, the key insight of this work is to interpret Neural ODE optimization as a min-max optimal control problem. More particularly, we present Game Theoretic Second-Order Neural Optimizer (GTSONO), a robust game theoretic optimizer based on the principles of min-max Differential Dynamic Programming. The proposed method exhibits significant computational benefits due to efficient matrix decompositions and provides convergence guarantees to local saddle points. Empirically, the robustness of the proposed optimizer is demonstrated through greater robust accuracy compared to benchmark optimizers when trained on clean images. Additionally, its ability to provide a performance increase when adapted to an already existing adversarial defense technique is also illustrated. Finally, the superiority of the proposed update law over its gradient based counterpart highlights the potential benefits of incorporating robust optimal control paradigms into adversarial training methods.

1 INTRODUCTION

Despite the remarkable performance achieved by deep learning based models on diverse challenges from image classification to speech recognition, recent findings indicate that deep neural networks tend to be fragile to adversarial examples(Liu et al. (2020)). This problem of generating adversarial examples and conversely training models that are not easily misguided by these samples has gathered significant attention (Carrara et al. (2019)). However, the development of robust models has been proven to be more challenging (Tsipras et al. (2018)), rendering adversarial robustness a central object of study in machine learning (Carlini et al. (2019)). This study has led to significant advance in understanding the generalization and interpretability behind the learning capacity of these models (Huang et al. (2022)). Traditionally, the most widely used approach to building adversarially robust models has been adversarial training in which the classifier is trained on both adversarial and clean samples aiming to generalize well in both instances(Madry et al. (2017); Wong & Kolter (2018); Raghunathan et al. (2018); Goodfellow et al. (2014)). The goal of adversarial training is to find the saddle point that solves the following optimization problem

min θ E h max δ S L(x + δ, y; θ) i (1)

where x Rm is the state vector and u Rn is the control, the perturbation term δ is bounded by sup ||δ||2 β2 and belongs to set S, which is the set of admissible perturbations, and L(x + δ, θ) is the definition of the loss function. Although adversarial training is effective, there are some drawbacks such as the considerable increase in the training time.

In the work by Poznyak et al. (2023), it was suggested that a min-max robust control can be viewed as a neural dynamic programming approach using continuous differential neural networks. In this work, the peturbations δ were recasted, yielding the same Hamilton Jacobi Bellman equation as in the case of the game theoretic Differential Dynamic Programming (DDP) (Jacobson & Mayne (1970); Sun et al. (2018)).

In the realm of Optimal Control (OC), it has been demonstrated that game-theoretic formulation enables the model handling uncertainties and external disturbances Sun et al. (2015). This algorithm has been successfully tested in real-life robotic applications, coming up up with robust control policies

Published as a conference paper at ICLR 2024

for robotic systems operating under unknown external disturbances Sun et al. (2018), Morimoto et al. (2003). This paves the way to draw connections between adversarial defense methods and robust optimal control and design principled algorithms that demonstrate the capacity to handle disturbances.

In this work, drawing inspiration by the connection of training Neural Networks and continuous time OCP Liu et al. (2021a;b), we introduce a novel second order Game Theoretic Neural Optimizer, by recasting the training of Neural ODEs as trajectory optimization problem through this Min Max OCP paradigm. Our approach differs considerably from conventional min-max methods for adversarial learning. While most min-max adversarial training methods consider the adversary input, our proposed method instead consider the disturbance being injected through the antagonizing control. The resulting framework, called GTSONO, is designed based on the continuous time Min-Max DDP algorithm engendering a innately robust Neural Optimizer. This optimizer is capable of handling uncertainties and finding a control policy or equivalently a weight configuration that robustifies the network and enables it to resist adversarial attacks. Our analysis provides theoretic guarantees that our algorithm converges to a saddle point, and our experiments empirically verify that our algorithm converges to a saddle point even when trained with clean images, which is translated to an increased robustness of our model against adversarial attacks, even with natural training. Our contribution can be summarized in the following points:

We propose a novel game theoretic optimizer for training robust neural ODEs based on the principles of min-max , reformulating the input adversary as the antagonizing control to model system disturbances. An efficient computational framework is presented based on decomposing matrices into vectors, which can be easily obtain through solving the ODE. We illustrate that the proposed optimizer is guaranteed to converge to stable local saddle points under mild assumptions. We empirically showcase that GTSONO increases the robustness and generates more confident predictions compared to benchmark optimizers on natural training. We demonstrate that GTSONO is successfully adapted to already existing adversarial training methods, enhancing their robustness. The second order convergence nature of min-max DDP enables our optimizer to converge in fewer epochs, and lower wall-clock time, which highlights its computational efficiency.

2 RELATED WORKS

In order to view the landscape of the pertinent literature, we review the advancements in both fronts.

Adversarial Attacks. A number of adversarial attack methods have been developed, such as optimization based (Szegedy et al. (2013)), gradient-based (Moosavi-Dezfooli et al., 2016; Xie et al., 2019; Dong et al., 2018), and attacks from Generative Networks (Baluja & Fischer, 2017; Zhao et al., 2017). Traditionally, the most famous gradient based methods are the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014) and the Projected Gradient Descent (PGD) (Madry et al., 2017). The ℓ PGD attack is used to undermine the estimations and predictions of machine learning models, by iteratively perturbing input data in the direction of the gradient, while constraining the perturbations to stay within a predefined projection set determined by parameter ϵ. Conversely, FGSM is a more efficient gradient based attack, which operates by taking a single step in the direction of the gradient of the loss function with respect to the input. Therefore, the attack efficiency of FGSM is higher at the expense of its success rate, compared to the iterative PGD. Compared with other algorithms, network-generated adversarial examples have been observed to be deceptive, however their efficiency is deemed low, since they require multiple rounds of iterations for the optimal solution (Liang et al., 2022). Optimization based attacks i.e. Carlini & Wagner (CW; Carlini & Wagner (2017)) have been found to be more effective to certain defensive schemes, such as defensive distillation (Liang et al., 2022) and masking (Huang et al., 2022).

Adversarial Defense. In order to deal with the threat of adversarial attacks, on the one hand, researchers try to improve the model s resistance to these attacks, developing numerous defense methods, so that the model can make correct predictions on adversarial samples (Wang et al., 2022). Examples of such defense methods are defensive distillation (Papernot et al., 2016), gradient regularization (Papernot et al., 2017; Ross & Doshi-Velez, 2018), and adversarial training (Tramèr et al., 2017; Zhang et al., 2019a; Zheng et al., 2020), which is traditionally the most successful.

Published as a conference paper at ICLR 2024

Most notably, this technique attempts to improve the robustness of a neural network by training it with adversarial samples, i.e., generated by FGSM or PGD. However, the issue with this defense mechanism is the computational overhead. To speed up and solve this problem, Adversarial Training for Free (Free AT) (Shafahi et al., 2019) was proposed.

Recently more methods have been developed such as masking and denoising. In fact, many attack methods rely on the gradient information of the victim model. In gradient masking/obfuscation defense methods try to defend by hiding the gradient information (Naseer et al., 2019). Moreover, Huang et al. claimed that adaptive stepsize numerical ODE solver, such as DOPRI5, has a gradient masking effect that fails the PGD attacks which are dependent on gradient information. This implies that continuous neural models possess innate robustness compared to discrete neural networks. However, this method cannot fool gradient-free attacks such as CW and SPSA. On the other hand, since the generation of adversarial examples takes place by adding specific noise; image denoising is another technique to enable the models to resist adversarial attacks. Ordinary denoisers suffer from low performance due to error amplification effect. In this regard, Liao et al. proposed a high-level representation guided denoiser (HGD) suppressing the amplification of errors.

3 METHODOLOGY

3.1 TRAINING NEURAL ODES USING GAME THEORETIC OPTIMAL CONTROL THEORY

First, Neural ODEs (Chen et al. (2018)) generally concern the following optimization over an objective function: L:

min θ L(x(tf)), where dx

dt = F(t, x(t), θ), x(t0) = x0 (2)

where x(t) Rm, and F( , , θ) is a deep neural network parameterized by θ Rn. In the proposed methodology, Eq. (2) is recasted as a trajectory optimization problem, through the perspective of Min-Max OC, by introducing a second set of antagonizing controls (e.g. weights in DNN) that attempt the maximization of the loss function. This second set of parameters aims to represent the existence of disturbances providing the robustness of our model. Finally, for this problem, we obtain:

min u max v

Φ(xtf ) + Z tf

t0 ℓ(t, x, u, v)dt

dt = F(t, x, u, v), x(t0) = x0 du

dt = 0, u(t0) = θ dv

dt = 0, v(t0) = η (3)

Figure 1: Node Level and Layer Level Architecture of Game Theoretic Layer

where x x(t) Rm, u u(t) Rn, and v v(t) Rn, and F( , , u, v) characterizes the vector field and is parameterized by DNN with the conflicting sets of weights (u, v). Moreover, it is clear that (3) describes (2) wlog by taking (Φ, ℓ) = (L, 0). The functions Φ, and ℓare known as the terminal and running cost in the context of OCP, whereas in the DNN setting, they are equivalent to the loss function and the weight decay, respectively. Note that for this problem we consider symmetric layer-wise dynamics with respect to u, v. More explicitly, if a layer has a maximization counterpart, it has the same number of u and v parameters, as illustrated in the Node Level view of Figure 1. Assumption 3.1. The running cost ℓis assumed to be quadratic with respect to the controls, more specifically: ℓ ℓ(u, v), with ℓuu = Ru In, ℓvv = Rv In and ℓuv = 0, where In Rn n is the identity matrix, Rv and Ru are scalars, with |Rv| > |Ru|.

This assumption will enable us to write the Hessian in a form easily factorizable using Kronecker properties. The intuition for selecting |Rv| > |Ru| is that we wish to penalize the disturbance more severely and prevent instability. The time-invariant ODEs imposed for ut, and vt makes the ODE of xt equivalent to (2), while also prompts us to interpret Problem (3) as the search for an optimal initial condition for time-invariant controls ut, and vt. Next, we define the accumulated loss Q(t, x, u, v):

Q(t, x, u, v) = Φ(x(tf)) + Z tf

t ℓ(τ, x(τ), u(τ), v(τ))dτ (4)

Published as a conference paper at ICLR 2024

which implies that 0 = ℓ(t, x, u, v) + d Q(t,x,u,v)

dt . At this point our goal is to compute higher-order derivatives with respect to u, v. Note that the first order derivative of L wrt to θ, or η under our OCP formulation is equivalent to Q ut0 or Q vt0 respectively, as Q(t0, xt0, ut0, vt0) is practically the accumulation of error on [t0, tf]. This explains our previous remark that in essence we search an optimal initial condition for u, and v. Now, we frame the equation above along a nominal path to derive the ODEs that will help us to compute the second order derivatives at t0. Theorem 3.2. Consider Assumption 3.1, and a nominal trajectory ( x, u, v), that satisfies the ODEs in the constraints of (3) . Then, the second order derivatives of Q expanded locally around the nominal trajectory obey the following backward ODEs:

dt = Fx Qxx+Qxx F x , d Qxu

dt = Fx Qxu+Qxx F u, d Quu

dt = ℓuu+Fu Qxu+Qux F u,

dt = Fu Qxv + Qux F v , d Qvv

dt = ℓvv + Fv Qxv + Qvx F v , d Qxv

dt = Fx Qxv + Qxx F v

dt = Fu Qx + ℓu, d Qv

dt = Fv Qx + ℓv (5)

The terminal conditions are given by: Qx(tf) = Φx, Qxx(tf) = Φxx, and Qu(tf) = 0n 1, Qv(tf) = Quu(tf) = Qvv(tf) = Quv(tf) = 0n n, Qxu(tf) = Qxv(tf) = 0n m. Remark 3.3. The proof for this Theorem is postponed for Appendix A.1 and it follows the standard derivation of continuous-time DDP, considering linearized ODE dynamics along the nominal trajectory path. This theorem implies that the proposed framework is based on min-max DDP, which has the favorable property of second order convergence and can obtain first and second order derivatives with a single pass of an ODE solver, without any recursive computations. This contributes to avoiding accumulating integration errors, and increased runtimes.

3.2 SECOND-ORDER MATRIX FACTORIZATION

The second order nature of our optimizer renders the efficient manipulation of second order terms a critical component of our study, as the amount of parameters in neural networks and neural ODEs grows easily to an unfavorable number. Theorem 3.4. We assume the matrix Qxx(tf) to be a symmetric matrix of rank R m, which may be represented as: Qxx = PR i=1 yiy i , where yi Rm t [t0, tf]. Then let the second order matrices in (5) that contain derivative with respect to the state can be decomposed as follows:

i=1 qi(t)qi(t) , Qxu(t) =

i=1 qi(t)pi(t) , Qxv(t) =

i=1 qi(t)si(t) (6)

Then the vectors qi Rm, and pi(t), and si(t) Rn obey the following backward ODEs:

dt = Fxqi(t), dpi(t)

dt = Fuqi(t), dsi(t)

dt = Fvqi(t) (7)

with the terminal condition given by (qi(tf), pi(tf), si(tf)) = (yi, 0, 0). Additionally, the second order matrices Quu, Quv, and Qvv can be represented as

Quu(t) = r(t) +

i=1 pi(t)pi(t) , Qvv(t) = m(t) +

i=1 si(t)si(t) , Quv(t) =

i=1 pi(t)si(t) ,

where r(t) ℓuu(tf t) and m(t) ℓvv(tf t).

This theorem suggests that the coupled matrices back propagated through (5) can be decomposed into a set of vectors. For this reason, lower rank matrices observed in many Neural ODE applications alleviate the computational load and provide great memory efficiency (Chen et al., 2018). Substituting the ODEs from Eq. (7), into the expression in Eq. (8) yields:

Quu(t) = Ru I(tf t) +

tf Fuqidt Z t

tf Fuqidt (9)

Published as a conference paper at ICLR 2024

Similar expressions for the other matrices are left in the Appendix A.2. To avoid dimensionality issues and tensor representations, the integrations in Eq. (9) are separated into each layer j of the network. We denote uj, and vj the parameters of layer j, and consider the preactivation vector hj(t), as an affine combination of the weights with the input to the jth layer. Recall that we have considered symmetric dynamics layer-wise, implying that Fuj = Fvj, and in turn hj u = hj v = zj, and hj x = (u + v). This implies that Fujqi = zj ( F

hj qi) for feed-forward layers, and Fujqi =

hj qi) for convolution layers. where denotes the Kronecker product, and ˆ denotes the

de-convolution operator. Finally, assuming that z(t)j, and F

hj are: i) uncorrelated across time, ii) and pair-wise independent, we can express the layer-wise precondition matrix Qujuj, at time t = t0

Qujuj(t0) Ru I(t0 tf) + Z t0

zjz j dt | {z } Aj(t0)

| {z } Bj(t0)

Similarly for the other matrices: Qvjvj(t0) = Aj(t0) Bj(t0) Rv I(t0 tf) , and Qujvj(t0) = Aj(t0) Bj(t0). Note that for the rest of this paper, we will drop the subscript relating to the jth layer, our analysis will still concern individual layer, but can be easily extended to the entire architecture. The detailed derivation along with a discussion over these assumptions takes place in Appendix A.4.

3.3 EFFICIENT DECOMPOSITION OF UPDATE LAWS

Update Law from Min-Max DDP Recall that Neural ODE analysis and OCP principles are deeply intertwined. That motivates us to view the optimization process of the Neural ODEs as a trajectory optimization task. Consider the closed loop Min-Max DDP update law δut δvt

where the feed-forward gains lu, along with lv, and the feedback gains Ku, Kv are given by

lu = Q 1 uu(Quv Q 1 vv Qv Qu), and Ku = Q 1 uu(Qux Quv Q 1 vv Qvx)

lv = Q 1 vv(Quv Q 1 uu Qu Qv), and Kv = Q 1 vv(Qvx Quv Q 1 vv Qvx) (12)

with Quu = (Quu Quv Q 1 vv Qvu), and Qvv = (Qvv Qvu Q 1 uu Quv). A detailed derivation of this update law is given in Appendix B.2. Setting δxt = 0 yields the open loop update scheme for the Min-Max DDP, which will be our focal point for this study. Notice that our method yields Gradient Descent Ascent (GDA; (Jin et al., 2020; Lin et al., 2020)) with Hessian Preconditioning for Quv = 0, implying that open loop min-max DDP is essentially a generalization of GDA. In Appendix A.6, we also provide tractable expressions for the terms in the closed loop update rule. Remark 3.5. To generate tractable expressions for the terms in Eq. (11), we leverage two basic properties of Kronecker products: 1. (A B + λI) 1 = (UA UB)(ΣA ΣB + λ) 1(UA UB) , 2. (A B)vec(X) = vec(BXA ).

Proposition 3.6. From the eigenvalue decomposition of the Kronecker products of every matrix term in Quu, Qvv, we derive these equivalent expression for their eigenvalue decomposition.

Q 1 uu = (UA UB)(Su Suv S 1 v Suv) 1(UA UB)

Q 1 vv = (UA UB)(Sv Suv S 1 v Suv) 1(UA UB) (13)

where Su, Su, Suv are the matrices containing the singular values of the matrix terms in Quu, Qvv.

Proposition 3.7. Following Property 3.6 and leveraging the second property mentioned in Remark 3.5 we can rewrite the feed forward gains in 43 more compactly

lu = vec(UBGuv U A) vec(UBGuu U A), lv = vec(UBGvu U A) vec(UBGvv U A) (14)

where Guv, Guv, Guv, Guv are matrix terms defined in the Appendix A.6.

Published as a conference paper at ICLR 2024

Detailed proofs for Propositions 3.6, and 3.7 are left in the Appendix A.5, and A.6 respectively. Algorithm 1 summarizes our Game Theoretic optimizer.

Algorithm 1 GTSONO

1: Input: dataset D, parameterized vector field F( , , u, v), integration time [t0, tf], ODESolver: ODESolve , learning rate η, time step t, Tikhonov regularization constants Ru, Rv 2: repeat 3: x(tf) = ODESolve(x(t0), t0, tf, F), where x(t0) D 4: Initialize Optimizer 5: for ti in {tf, tf + t, . . . , t0 + t, t0} 6: [x(ti 1, Qu(ti 1), Qv(ti 1), qi(ti 1)] = ODESolve([x(ti), Qu(ti), Qv(ti), qi(ti)], ti 1, ti, F) 7: For layer j evaluate Aj(ti), and Bj(ti) 8: end for 9: Aj(t0) = P

ti Aj(ti) t, Bj(t0) = P

ti Bj(ti) t 10: Compute ℓu, ℓv from Eq. 14 11: Update controls: u u + ηℓu, v v + ηℓv 12: until converges

Figure 2: GTSONO overview, where F: forward dynamics F, and F: backward dynamics

4 CONVERGENCE

In sequential two-player zero sum games, a global minimax point always exists even if f is nonconvexnonconcave, due to the extreme-value theorem (Jin et al. (2020)). Conversely, granted the updates of the proposed algorithm take place simultaneously, and due to the non-convexity of problem (3) finding a global saddle point is NP-hard, thus we settle for local saddle points. However, even that task is not straightforward, as in the simultaneous min-max setting, GDA methods may display some undesirable behaviours, such as convergence to a non-critical point (Daskalakis & Panageas, 2018), or to an unstable local saddle point (Wang et al., 2019). In this regard, we provide local convergence guarantees for our optimizer to a locally stable saddle point.

For convenience on handling the variables we further define the vector z = [u; v] R2n, along with the function G : R2n R2n, as G(z) = [ u Q(u, v); v Q(u, v)]. Accordingly the Jacobian

of G is given by: JG(z) = Quu Quv Qvu Qvv

R2n 2n. Notice that after computing the inverse

of JG(z) using the Schur complement (Liu et al., 2021b), the preconditioned update rule can be rewritten as zk+1 = zk ηJG 1(zk)G(zk) yielding the update rule for the open loop Min-Max DDP shown in Eq. 11, for δxt = 0. Let us start with the definition and sufficient conditions of a local saddle point.

Definition 4.1. Let Kγ = {z | z z γ}, be a neighborhood around fixed point z = [u ; v ]. Then, z is a local saddle point, if f(u , v) f(u , v ) f(u, v ), u, v Kγ.

Assumption 4.2. Let z = [u ; v ] be a stationary point. Recall that we have Quu = A B + Ru I, and Qvv = A B Rv I, for Ru, Rv > 0. Assuming that wlog λmin(A B) m, and λmax(A B) M, for m, M > 0, then there exist appropriate Ru > 0 and Rv > 0 large enough such that λmin(Quu) > 0, and λmax(Qvv) < 0, ensuring that z is a local saddle point.

Our next step is to prove local convergence guarantees to a local saddle point z .

Theorem 4.3. Suppose z0 Kγ. Then for L-Lipschitz function Q(z), with also L -Lipschitz Jacobian of G(z), the accumulative difference between iterates and fixed point z , as well as the total error from Q(z ) can be bounded by a constant

k=0 zk z O(1),

k=0 |Q(zk) Q(z )| O(1). (15)

Published as a conference paper at ICLR 2024

This implies that the iterates converge and the total error goes to 0, as K . The proof of Theorem 4.3 can be found in Appendix C.1. Finally, we proceed to show that our algorithm converges to a strictly stable fixed pointed, implying local convergence (Wang et al. (2019)). Lemma 4.4 (Proposition 1.4 (Daskalakis & Panageas, 2018)). Assume fixed point z = (u , v ), for which it holds f(z ) = 0 is locally stable if the Jacobian of the update rule satisfies ρ(J) 1, where ρ( ) is the spectral norm.

To leverage the Lemma above, we begin with the Jacobian of our update rule J = I η(JG(zk) 1JG(zk) + zk(JG 1(zk))G(zk)) (16) Proposition 4.5. Considering linearized ODE dynamics, we can deduce that for the derivatives Quu and Qvv, with respect to zk = [uk, vk]: zk Quu = zk Qvv = 0, implying (JG 1(zk)) = 0.

From this proposition, we can easily infer for Eq. (16) that ||J|| < 1, η (0, 1). Therefore, our optimizer convergenges to stable saddle points, implying stability. The proof of Proposition 4.5 has been postponed for Appendix C.2.

5 EXPERIMENTS

We validate the efficacy of our algorithm in comparison to other state-of-the-art optimizers widely used in Neural ODE applications, and we highlight the ability of GTSONO to be successfully adapted to adversarial training methods. All experiments are conducted on a TITAN RTX.

Datasets We carry out our experiments on two image datasets: CIFAR and SVHN. Both datasets have been standardized and consist of 3 32 32 colour images, and 10 label classes.

Networks The benchmark optimizers were applied on identical network structures, for every experiment round. GTSONO was also evaluated on the same formation with two identical sets of antagonizing weights, implying double parameters compared to the benchmark models. Furthermore, to reduce the number of trainable parameters, we also consider a framework containing the antagonizing set of controls only in the convolution layers. We will refer to this architecture as c-GTSONO. More information regarding the network architectures are provided in the Appendix D.

Attacks In this study, we focus on the white box ℓ -norm PGD, FGSM attacks and gray/ blackbox attack CW attack to assess the robustness of our models. Every disturbance mentioned in our experiments below is applied on the standardized dataset. Details about the update rules of the attacks are left for Appendix D. We denote each model s accuracy to the clean test set (i.e. natural accuracy) with Anat; PGDϵ s the PGD attack with a perturbation distance ϵ that takes s steps in the direction of the gradient; and FGSM α the FGSM attack whose single step in the direction of the gradient is multiplied with constant α. The CW is denoted simply as CW as the ℓ norm was considered and for every experiment the pertubation distance was set to ϵ = 0.03, and max-iterations K = 100.

5.1 RESULTS

Optimizer Comparison For this round of experimens we compare GTSONO with Adam SGD with momentum as first order benchmark optimizers, and SNOpt (Liu et al., 2021a) as a second order baseline, on the two aforementioned datasets. The training for all optimizers was carried out using only non-perturbed images, with a batch size of 500 images on every dataset. The networks trained on both datasets was trained for 15 epochs. More details about the structure the network of each optimizer are provided in the Appendix D.

As shown from Tables 1 and 2, GTSONO outperforms the benchmark optimizers in both datasets with natural training, and that the performance gap increases as the degree of perturbation also increases. Additionally, it is also shown that GTSONO provides more robust predictions on average, as well as more confident as the standard deviation is smaller compared to the benchmark optimizers. Finally, we observe that the C-GTSONO version of our algorithm is the most efficient and best performing providing more accurate predictions in less training time. More details about the training times and memory consumption of each optimizer are left in the Appendix D.

GTSONO on Adversarial Training Methods In this round of experiments, we highlight the applicability of GTSONO to be efficiently adapted to other adversarial training methods and enhance

Published as a conference paper at ICLR 2024

their performance. More specifically, we consider the Free Adversarial Training (Free AT) scheme (Shafahi et al., 2019)) and TRADES (Zhang et al., 2019b), with SGD as their optimizer, and evaluate how these adversarial training methods benefit from employing C-GTSONO as their optimizer, instead of SGD. We perform two ablation studies by experimenting with the number of examined internal iterations for Free AT are m = 4, 8, while in TRADES we fix the number of internal iterations to generate adversarial examples to 5 and evaluated for 1/λ = 6, 10. More details about each adversarial training method and the selection of their hyperparameters are left in Appendix D.

Table 1: Average standard deviation of test set accuracy (%) on the CIFAR10 for each optimizer. Anat denotes the natural accuracy. PGDϵ s denotes the accuracy under PGD attack, taking s steps in the direction of the gradient with a perturbation distance ϵ. FGSMα describes the accuracy under FGSM attack where the single gradient step is multiplied with constant α. CW denotes the accuracy under the CW attack.

Optimizer Anat FGSM0.03 FGSM0.05 PGD20 0.03 PGD20 0.05 CW Adam 78.7 1.1 48.2 0.7 30.8 0.5 45.1 1.2 29.1 0.3 15.4 0.6 SGD 77.5 0.6 47.3 1.3 33.8 1.3 45.8 1.5 29.3 1.4 18.3 1.6 SNOpt 79.1 0.4 48.7 1.0 35.7 1.0 46.8 1.4 32.1 1.4 7.5 1.0

GTSONO 74.7 0.6 51.7 0.3 37.9 0.4 49.9 0.5 34.9 + 1.1 18.0 1.5 C-GTSONO 74.7 0.7 51.8 0.4 38.0 0.2 50.6 0.3 35.0 + 0.2 36.3 2.2

Table 2: Average standard deviation of test set accuracy (%) on the SVHN for each optimizer.

Optimizer Anat FGSM0.03 FGSM0.05 PGD20 0.03 PGD20 0.05 CW Adam 98.9 0.3 73.8 0.4 55.9 0.8 71.8 0.1 48.4 1.0 20.3 0.2 SGD 98.4 0.0 74.4 0.4 56.1 0.5 72.4 0.7 50.4 1.1 23.4 0.9 SNOpt 99.1 0.1 73.5 2.2 54.4 2.3 71.9 2.7 48.7 3.3 22.4 0.9 GTSONO 99.6 0.0 78.0 0.4 58.9 0.4 76.7 0.8 54.3 0.8 31.6 2.2 C-GTSONO 97.3 0.2 80.8 0.2 65.2 0.3 80.3 0.4 62.3 0.3 50.5 0.8

Table 3: Comparison between GTSONO and SGD applied on Free AT, for m = 4, 8.

Optimizer Anat PGD0.03 20 PGD0.03 40 CW SGD (m=4) 76.8 48.8 48.1 6.5 C-GTSONO (m=4) 75.7 50.5 50.2 17.4 SGD (m=8) 76.2 54.8 54.2 9.5 C-GTSONO (m=8) 74.4 56.5 56.4 18.9

To avoid robust overfitting, we evaluate the robustness of each optimizer, after convergence is observed. Although GTSONO presents a slower per-iteration training time, from Figure 3 we observe that in both methods our optimizer converges to a local saddle point in considerable fewer iterations, due to the higher order convergence rate inherited by min-max DDP. In both defense schemes, GTSONO was found to converge approximately after 10 or less epochs, whereas the benchmark models required approximately double epochs to converge. This results in less training time overall, which is critical in adversarial training methods which generally take longer to train due to the internal iterations required to generate adversarial samples. Table 5 present a comparison between the training time required by GTSONO and the baseline model in Free AT for m = 8, demonstrating the faster wall clock adversarial training time of GTSONO. The remaining comparisons about the required resources are left in the Appendix D. Finally, Tables 3 and 4 indicate that the adaptation of our optimizer increases the robust accuracy in both defense methods. More specifically, Free AT benefited by the adaptation of GTSONO by a performance increase of approximately 2% in PGD attacks, whereas in TRADES we obtained a performance increase of more than 4% for both tested values of λ. Interestingly, the benchmark models were found to perform better under CW .

Update Laws Recall, that in this study our focal point was the open-loop Min-Max DDP, however as mentioned in in Section 3, setting the cross-terms: Qvu = 0, we readily obtain GDA with Hessian Preconditioning. We investigate the difference in the performance offered by the open loop update rule

Published as a conference paper at ICLR 2024

Table 4: Comparison between GTSONO and SGD used with TRADES, for λ 1 = 6, 10.

Method Anat PGD0.03 20 PGD0.03 40 CW TRADES(λ 1=6) 73.2 54.5 54.3 22.8 C-GTSONO(λ 1=6) 75.8 58.9 58.8 20.4 TRADES(λ 1=10) 69.4 57.2 56.9 25.2 C-GTSONO(λ 1=10) 72.6 61.2 61.0 20.7

Figure 3: Convergence comparison between GTSONO and SGD in Left : Free AT, Right : TRADES

(DDP-GTSONO), and by the GDA update (GDA-GTSONO), under the FGSM, PGD for increasing perturbation. In Figure 4 we observe that on the SVHN dataset, DDP-GTSONO outperforms its counterpart, for every examined disturbance. Additionally on the CIFAR 10 dataset, GDA-GTSONO is more accurate for small perturbations, however for increasing disturbance, it is observed that DDP-GTSONO achieves better robust accuracy against both attacks, demonstrating the superiority of our DDP based algorithm.

Figure 4: Comparison of DDP and GDA GTSONO

Table 5: Training time per iterations and epochs trained for Free AT, m = 4 (min:sec)

Optimizer Time per iteration (sec) Epochs

SGD 0.63 40

C-GTSONO 1.4 5

Limitations The achieved robustness of our optimizer comes at the price of a slower per iteration training time and higher memory consumption, mainly due to the required computation of Quu, Quv, Qvv. However, as shown in Appendix D, it is suggested that GTSONO is still affordable for current GPUs. Additionally, we observed that the continuous models considered for this study were found to be prone to robust overfitting, in some instances not only in our algorithm but also in the benchmark optimizers as well.

6 CONCLUSIONS

We present an efficient game theoretic optimizer for training robust neural ODEs, with provable convergence guarantees to local saddle points. Based on the principles of min-max DDP, our framework leverages the second order convergence of this OC paradigm by employing Kronecker factorization to decompose matrices into vectors which can be easily obtained through solving the ODE. This allows us to derive tractable expressions for the terms in the update of open loop min-max DDP. Empirically, GTSONO increased the robustness and generated more accurate and confident predictions on attacked images compared to benchmark optimizers under natural training. We also demonstrated that GTSONO is successfully adapted to already existing adversarial training methods, enhancing their robustness. In future work, we wish to extend the applicability of our optimizer to a larger variety of network architectures (e.g. Res Nets, Transformers), and explore appropriate regularization techniques, to further improve the performance of our algorithm. Finally, we wish to investigate recasting the proposed optimizer through distributed DDP schemes, as in (Saravanos et al., 2023). In summary, our work paves new ways for robust optimal control methodologies to be applied in deep continuous models to enhance their robustness.

Published as a conference paper at ICLR 2024

ACKNOWLEDGEMENTS

This material is based upon work supported by the NASA Aeronautics Research Mission Directorate (ARMD) University Leadership Initiative (ULI) under cooperative agreement number 80NSSC22M0070. Augustinos Saravanos acknowledges financial support by the A. Onassis Foundation Scholarship.

Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Local saddle point optimization: A curvature exploitation approach. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 486 495. PMLR, 2019.

Shumeet Baluja and Ian Fischer. Adversarial transformation networks: Learning to generate adversarial examples. ar Xiv preprint ar Xiv:1703.09387, 2017.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39 57. Ieee, 2017.

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. ar Xiv preprint ar Xiv:1902.06705, 2019.

Fabio Carrara, Roberto Caldelli, Fabrizio Falchi, and Giuseppe Amato. On the robustness to adversarial examples of neural ode image classifiers. In 2019 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1 6. IEEE, 2019.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. Advances in neural information processing systems, 31, 2018.

Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9185 9193, 2018.

J. R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6(1):19 26, 1980. ISSN 0377-0427. doi: https: //doi.org/10.1016/0771-050X(80)90013-3. URL https://www.sciencedirect.com/ science/article/pii/0771050X80900133.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Yifei Huang, Yaodong Yu, Hongyang Zhang, Yi Ma, and Yuan Yao. Adversarial robustness of stabilized neural ode might be from obfuscated gradients. In Mathematical and Scientific Machine Learning, pp. 497 515. PMLR, 2022.

David H Jacobson and David Q Mayne. Differential dynamic programming. Number 24. Elsevier Publishing Company, 1970.

Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International conference on machine learning, pp. 4880 4889. PMLR, 2020.

César Laurent, Thomas George, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. An evaluation of fisher approximations beyond kronecker factorization. 2018.

Antoine Lesage-Landry, Joshua A Taylor, and Iman Shames. Second-order online nonconvex optimization. IEEE Transactions on Automatic Control, 66(10):4866 4872, 2020.

Published as a conference paper at ICLR 2024

Hongshuo Liang, Erlu He, Yangyang Zhao, Zhe Jia, and Hao Li. Adversarial attack and defense: A survey. Electronics, 11(8):1283, 2022.

Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1778 1787, 2018.

Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pp. 6083 6093. PMLR, 2020.

Guan-Horng Liu, Tianrong Chen, and Evangelos Theodorou. Second-order neural ode optimizer. Advances in Neural Information Processing Systems, 34:25267 25279, 2021a.

Guan-Horng Liu, Tianrong Chen, and Evangelos A. Theodorou. Ddpnopt: Differential dynamic programming neural optimizer, 2021b.

Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. How does noise help robustness? explanation and exploration under the neural sde framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 282 290, 2020.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408 2417. PMLR, 2015.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574 2582, 2016.

Jun Morimoto, Garth Zeglin, and Christopher G Atkeson. Minimax differential dynamic programming: Application to a biped walking robot. In Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003)(Cat. No. 03CH37453), volume 2, pp. 1927 1932. IEEE, 2003.

Muzammal Naseer, Salman Khan, and Fatih Porikli. Local gradients smoothing: Defense against localized adversarial attacks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1300 1307, 2019. doi: 10.1109/WACV.2019.00143.

Yunpeng Pan and Evangelos Theodorou. Probabilistic differential dynamic programming. Advances in Neural Information Processing Systems, 27, 2014.

Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pp. 582 597. IEEE, 2016.

Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506 519, 2017.

Alexander Poznyak, Sebastian Noriega-Marquez, Alejandra Hernandez-Sanchez, Mariana Ballesteros Escamilla, and Isaac Chairez. Min max dynamic programming control for systems with uncertain mathematical models via differential neural network bellman s function approximation. Mathematics, 11(5):1211, 2023.

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. ar Xiv preprint ar Xiv:1801.09344, 2018.

Andrew Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.

Published as a conference paper at ICLR 2024

Augustinos D. Saravanos, Yuichiro Aoyama, Hongchang Zhu, and Evangelos A. Theodorou. Distributed differential dynamic programming architectures for large-scale multiagent control. IEEE Transactions on Robotics, pp. 1 21, 2023. doi: 10.1109/TRO.2023.3319894.

Ali Shafahi, Mahyar Najibi, Mohammad Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.

Wei Sun, Evangelos A Theodorou, and Panagiotis Tsiotras. Game theoretic continuous time differential dynamic programming. In 2015 American control conference (ACC), pp. 5593 5598. IEEE, 2015.

Wei Sun, Yunpeng Pan, Jaein Lim, Evangelos A Theodorou, and Panagiotis Tsiotras. Min-max differential dynamic programming: Continuous and discrete time formulations. Journal of Guidance, Control, and Dynamics, 41(12):2568 2580, 2018.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204, 2017.

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018.

Jia Wang, Chengyu Wang, Qiuzhen Lin, Chengwen Luo, Chao Wu, and Jianqiang Li. Adversarial attacks and defenses in deep learning for image recognition: A survey. Neurocomputing, 2022.

Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. ar Xiv preprint ar Xiv:1910.07512, 2019.

Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International conference on machine learning, pp. 5286 5295. PMLR, 2018.

Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting hessian: Understanding common structure of hessian in neural networks. ar Xiv preprint ar Xiv:2010.04261, 2020.

Cihang Xie, Zhishuai Zhang, Yuyin Zhou, Song Bai, Jianyu Wang, Zhou Ren, and Alan L Yuille. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2730 2739, 2019.

Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Accelerating adversarial training via maximal principle. Advances in Neural Information Processing Systems, 32, 2019a.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pp. 7472 7482. PMLR, 2019b.

Zhengli Zhao, Dheeru Dua, and Sameer Singh. Generating natural adversarial examples. ar Xiv preprint ar Xiv:1710.11342, 2017.

Haizhong Zheng, Ziqi Zhang, Juncheng Gu, Honglak Lee, and Atul Prakash. Efficient adversarial training with transferable adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1181 1190, 2020.

Published as a conference paper at ICLR 2024

A MISSING DERIVATIONS FROM SECTION 3

We recall the formulation of the objective of the Neural ODE framework expressd in a fashion easily interpretable through the prism of game theoretic OCP.

min u max v

Φ(xtf ) + Z tf

t0 ℓ(t, x, u, v)dt

, subjected to

dt = F(t, x, u, v), x(t0) = x0 du

dt = 0, u(t0) = θ dv

dt = 0, v(t0) = η (17) where x x(t) Rm, u u(t) Rn, and v v(t) Rn. It is clear that (17) describes (2) without loss of generality by taking (Φ, ℓ) = (ℓ, 0). The function F(t, x, u, v) characterizes the vector field and is parameterized by a Deep Neural Network (DNN). We consider that the dynamics of the system are symetric with respect to the two sets of weights (u, v). The functions Φ, and ℓ are known as the terminal and running cost in the context of OCP. This problem is understood as particular type of OCP that searches for the optimal initial condition of the time-invariant controls ut, vt. In the DNN setting, the terminal cost is equivalent to the loss function, for instance Categorical Cross Entropy for multi-label classification, and the running cost is equivalent to the weight decay, or some other regularization techniques that acts on the weights of the intermediate hidden layers. Next, the accumulated loss Q(t, x, u, v) is defined as follows:

Q(t, x, u, v) = Φ(x(tf)) + Z tf

t ℓ(τ, x(τ), u(τ), v(τ))dτ (18)

From this definition of the accumulated loss implies that Q we can readily obtain:

0 = ℓ(t, x, u, v) + d Q(t, x, u, v)

A.1 PROOF OF THEOREM 3.2

Considering the nominal trajectory ( x, u, v), that satisfies the ODEs in the constraints of (3), we take the Taylor expansion of the terms in Eq.19 keeping up to second order terms:

ℓ(x, u, v) = ℓ( x, u, v) + ℓxδxt + ℓuδu + ℓvδv + 1

"δxt δut δvt

# "ℓxx ℓxu ℓxv ℓux ℓuu ℓuv ℓvx ℓvu ℓvv

# "δxt δut δvt

Q(x, u, v) = Q( x, u, v) + Qxδxt + Quδu + Qvδv + 1

"δxt δut δvt

# "Qxx Qxu Qxv Qux Quu Quv Qvx Qvu Qvv

# "δxt δut δvt

We differentiate 20b, in order to utilize Eq. 19, and taking into account that dδu

dt = 0, we obtain: d Q

dt d Q( x, u, v)

dt δxt + Qx dδx

dt ) + (d Qu

dt δu + Qu dδu

dt ) + (d Qv

dt δv + Qv dδv

"δxt δut δvt

"δxt δut δvt

"Qxx Qxu Qxv Qux Quu Quv Qvx Qvu Qvv

# "δxt δut δvt

"δxt δut δvt

# "Qxx Qxu Qxv Qux Quu Quv Qvx Qvu Qvv

Next, we need to compute the term δx

dt = Fxδx + Fuδu + Fvδv (22)

The second equation comes from the fact that the time derivative evaluated for fixed x yields the dynamics F( x, u, v), whereas for the first time derivative we take the Taylor expansion around the

Published as a conference paper at ICLR 2024

nominal trajectory. Finally, substituting Eq. 22 into Eq. 21, and back into 19, we obtain the following ODE for Q and its derivatives

dt = Fx Qx + ℓx, d Qxx

dt = ℓxx + Fx Qxx + Qxx F x , d Qxu

dt = ℓxu + Fx Qxu + Qxx F u (23a)

dt = Fu Qx + ℓu, d Quu

dt = ℓuu + Fu Qxu + Qux F u, d Quv

dt = ℓuv + Fu Qxv + Qux F v (23b)

dt = Fv Qx + ℓv, d Qvv

dt = ℓvv + Fv Qxv + Qvx F v d Qxv

dt = ℓxv + Fx Qxv + Qxx F v (23c) At this point, recall Assumption 3.1, then we easily obtain Eq. 5 for the ODEs of the second order derivatives.

A.2 PROOF THEOREM 3.4

Proof. We recall 3.4, where it was assumed the matrix Qxx(t1) to be a symmetric matrix of rank R m, it may be represented as: Qxx = PR i=1 yiy i , where yi Rm. Additionally, we had that Quu(tf) = 0, Qvv(tf) = 0. Then t [t0, tf], the second order matrices in (23) that contain derivative with respect to the state can be decomposed as follows:

i=1 qi(t)qi(t) , Qxu(t) =

i=1 qi(t)pi(t) , Qxv(t) =

i=1 qi(t)si(t) (24)

From Eq. 23, we obtain:

dt = ℓxx + Fx Qxx + Qxx F x

i=1 qi(t)qi(t) +

qi(t)qi(t) F x

Fxqi(t) qi(t) +

i=1 qi(t) Fxqi(t)

In the first equality, ℓxx is equal to 0, from Assumption 3.1. Additionally, taking the time derivative of Qxx in Eq. 24 yields

i=1 qi(t)qi(t) =

dt qi(t) + qi(t)dqi(t)

Equating Equations 26 and 25 yields: dqi

dt = Fxqi. We proceed in similar fashion for Qxu, and Qxv.

dt = ℓxu + Fx Qxu + Qxx F u = Fx Qxu + Qxx F u

i=1 qi(t)pi(t) )+ R X

i=1 qi(t)qi(t) F u

Fxqi(t) pi(t) +

i=1 qi(t) Fuqi(t)

dt = ℓxv + Fx Qxv + Qxx F v = Fx Qxv + Qxx F v

i=1 qi(t)si(t) )+ R X

i=1 qi(t)qi(t) F v

Fxqi(t) si(t) +

i=1 qi(t) Fvqi(t)

Published as a conference paper at ICLR 2024

In similar fashion, we also compute the time derivative for Qxu and Qxv.

i=1 qi(t)pi(t) =

dt pi(t) + qi(t)dpi(t)

i=1 qi(t)si(t) =

dt si(t) + qi(t)dsi(t)

It follows that dpi

dt = Fuqi, and dsi

dt = Fvqi (31)

Based on the results above, we can deduce that the Hessian Quu can be rewritten as follows

dt = ℓuu + Fu Qxu + Qux F u

Fuqi(t) pi(t) +

i=1 pi(t) Fuqi(t) (32)

Integrating the equation above for t [t, tf] yields

dt dt = Z tf

t Ru Idt + Z tf

Fuqi(t) pi(t) +

i=1 pi(t) Fuqi(t) !

Recall the terminal condition Quu(tf) = 0, and using Eq. 31, we obtain

Quu(t) = Ru I(tf t)

dt p i + pi

= Ru I(tf t) +

d dt(pip i )dt

= Ru I(tf t) +

i=1 pi(t)p i (t)

Therefore second order matrices can be rewritten as follows:

Quu(t) = Ru I(tf t) +

tf Fuqidt Z t

tf Fuqidt (35a)

Qvv(t) = Rv I(tf t) +

tf Fvqidt Z t

tf Fvqidt (35b)

tf Fuqidt Z t

tf Fvqidt (35c)

tf Fxqidt Z t

tf Fuqidt , Qvx(t) =

tf Fxqidt Z t

tf Fvqidt (35d)

A.3 LAYER WISE PARTIONING

To avoid dimensionality issues and tensor representations, the integrations in Eq. (35) are separated into each layer j of the network. We denote aj, hj, uj, and vj as the activation, pre-activation (linear combination), and the parameters of layer j, respectively. Furthermore, we consider the the preactivation vector hj(t), as an affine combination of the weights with the input to the jth layer. We recall that we have symmetric layer-wise dynamics with respect to u, and v, resulting in the

Published as a conference paper at ICLR 2024

partial derivatives of the preactivation vector at the jth layer with respect to the control variables: hj u = hj v = aj, and hj x = (u + v). This implies that:

Fxjqi = (u + v) ( F

hj qi) for Feed Forward layers (36a)

hj (aj I)qi = aj ( F

hj qi) for Feed Forward layers (36b)

Fxjqi = (u + v)ˆ ( F

hj qi) for the Convolution layers (36c)

Fujqi = ajˆ ( Fuj

hj qi) for the Convolution layers (36d)

where denotes the Kronecker product, and ˆ denotes the de-convolution operator. The layer-wise notation used above was adopted as a manner to circumvent dimensionality and tensor representations issues. In this vein, the integrations in equations (35) are broken down into each layer j of the network structure, where using the expression from 36, we obtain Z t0

dt = h . . . , Z t0

dt, . . . i = h . . . , Z t0

(u + v) ( F

hj qi) dt, . . . i (37a)

dt = h . . . , Z t0

dt, . . . i = h . . . , Z t0

hj qi) dt, . . . i (37b)

dt = h . . . , Z t0

dt, . . . i = h . . . , Z t0

hj qi) dt, . . . i (37c)

A.4 DERIVATION OF EQUATION 10

Following the layer-wise representation, We recall that to derive 10 these expressions, we first use the Kronecker product property: (A B)(C D) = AC BD. Additionally, during the process of this derivation some approximations are necessary. The following assumptions are necessary in order to derive

Assumption A.1. Suppose:

a(t)j, and g(t)j are uncorrelated across time

aj(t), gj(t) are pair-wise independent (Liu et al. 2021 NIPS)

For the sake of brevity we demonstrate the analytic proof only for the Hessian Quu, the others follow similarly.

Qujuj(t0) = Ru I(t0 tf) +

tf Fuqidt Z t

= Ru I(t0 tf) +

hj qi)dt Z t

Ru I(t0 tf) +

hj qi) aj ( F

= Ru I(t0 tf) +

Ru I(t0 tf) + Z t0

aja j dt | {z } Aj(t)

| {z } Bj(t)

Published as a conference paper at ICLR 2024

Remark A.2. We begin by underlining that both assumptions are widely adopted (Martens & Grosse (2015)). The first assumption admits that aj(t) gj(t) are temporally uncorrelated. This is necessary to yield tractable Kronecker matrices for second-order optimization. Although, it may be a strong assumption, in some cases, it has been empirically observed that the uncorrelated temporal assumption may yield better performance Laurent et al. (2018)). The second assumption admits that aj(t) and gj(t) are pair-wise independents, which has been verified empirically in Wu et al. (2020).

A.5 PROOF OF PROPOSITION 3.6

Proof. We begin by trying to simplifying the expressions for Quu, and Qvv. Setting: (ΣA ΣB) = Suv, (ΣA ΣB + λu) = Su, and (ΣA ΣB + λv) = Sv, we obtain:

Quu = (Quu Quv Q 1 vv Qvu) = (UA UB)Su(UA UB) (UA UB)Suv(UA UB)

(UA UB)S 1 v (UA UB) (UA UB)Suv(UA UB)

= (UA UB)Su(UA UB) (UA UB)(Suv S 1 v Suv)(UA UB)

= (UA UB)(Su Suv S 1 v Suv)(UA UB)

Similarly, for Qvv = (UA UB)(Sv Suv S 1 u Suv)(UA UB) . At this point, using the following properties of the Kronecker product, based on the eigenvalue decomposition:

(A B + λI) = (UA UB)(ΣA ΣB + λ)(UA UB) (40)

(A B + λI) 1 = (UA UB)(ΣA ΣB + λ) 1(UA UB) (41)

we can easily obtain the inverse of the matrices in 39 as follows:

Q 1 uu = (UA UB)(Su Suv S 1 v Suv) 1(UA UB) , Q 1 vv = (UA UB)(Sv Suv S 1 u Suv) 1(UA UB) .

A.6 PROOF OF PROPOSITION 3.7

Proof. We begin by considerin the closed loop Min-Max DDP update law

where the feed-forward gains lu, along with lv, and the feedback gains Ku, Kv are given by

lu = Q 1 uu(Quv Q 1 vv Qv Qu), and Ku = Q 1 uu(Qux Quv Q 1 vv Qvx)

lv = Q 1 vv(Quv Q 1 uu Qu Qv), and Kv = Q 1 vv(Qvx Quv Q 1 vv Qvx) (43)

with Quu = (Quu Quv Q 1 vv Qvu), and Qvv = (Qvv Qvu Q 1 uu Quv).

At this point, we will leverage the following property of the Kronecker products: (A B)r = vec(BRA ), where R = vec 1(r) denotes the inverse vectorization of r , to express ℓu, and ℓv in a compact manner.

Setting for brevity: Quxδxt = δqu, Qvxδxt = δqv, and substituting into 42 the expression for 13, the feed-forward and feedback gains in (43) can be compactly rewritten as follows

Published as a conference paper at ICLR 2024

lu =(UA UB) (Su Suv S 1 v Suv) 1(Suv S 1 v )vec(U B Qv UA) | {z } guv (UA UB) (Su Suv S 1 v Suv) 1vec(U B Qu UA) | {z } guu

= vec(UBGuv U A) vec(UBGuu U A)

lv =(UA UB) (Sv Suv S 1 u Suv) 1(Suv S 1 u )vec(U B Qu UA) | {z } gvu (UA UB) (Su Suv S 1 v Suv) 1vec(U B Qv UA) | {z } gvv

= vec(UBGvu U A) vec(UBGvv U A)

Ku =(UA UB) (Su Suv S 1 v Suv) 1(Suv S 1 v )δqv | {z } yvv (UA UB) (Su Suv S 1 v Suv) 1δqu | {z } yvv

= vec(UBYuv U A) vec(UBYuu U A) (44c)

Kv =(UA UB) (Su Suv S 1 v Suv) 1(Suv S 1 v )δqu | {z } yvv (UA UB) (Su Suv S 1 v Suv) 1δqv | {z } yvv

= vec(UBYvu U A) vec(UBYvv U A) (44d)

where vec 1(Qi) = Qi, vec 1(gij) = Gij, and vec 1(yij) = Yij for i = {u, v} denote the inverse of the vectorization operation of the corresponding vectors. The only required inversions are of the diagonal S matrices which are computationally inexpensive, rendering the computation of the terms in 44 tractable to compute.

B DIFFERENTIAL DYNAMIC PROGRAMMING IN CONTINUOUS TIME

B.1 PROBLEM FORMULATION

Consider a min-max game problem, with dynamics described the following ODE:

dt = F(x(t), u(t), v(t), t), x(t0) = x0 (45)

where x(t) X is the state of the dynamic system at t [t0, tf], and u(t, u(t)) U1 U and v(t, x(t)) U2 V denote conflicting controls, with U1 and U2 are convex sets containing all admissible controls of u and v respectively. For the sake of brevity, we will denote u(t, x(t)) u, and similarly v(t, x(t)) = v. The goal of this OCP scheme is to find non-anticipating strategies for both players. Additionally, we define the cost function as follows:

J(u, v) = ϕ(tf, xtf ) + Z tf

t0 ℓ(x, u, v, t)dτ (46)

where ϕ : [t0, tf] X R+ is the terminal cost, and ℓ: X U V [t0, tf] R+ is the running cost incorporating the state and the control cost for both players. The conflict between the two players enters in our problem formulation as one tries through control u to minimize the above cost function, whereas the other one tries to maximize it through control v. We proceed to define the value function for our problem as the function expressing the minmax value of the cost function at time t = t0 and x = x0.

V (t0, x0) = min u max v

ϕ(tf, xtf ) + Z tf

t0 ℓ(x, u, v, t)dτ

Published as a conference paper at ICLR 2024

Using the Bellman principle, we write (47), as follows:

V (t, x(t)) = min u(t tf ) max v(t tf )

ϕ(tf, xtf ) + Z t+dt

t ℓ(x, u, v, t)dτ + Z tf

t+dt ℓ(x, u, v, t)dτ

= min u(t t+dt) max v(t t+dt)

min u(t+dt tf ) max v(t+dt tf )

n ϕ(tf, xtf ) + Z tf

t+dt ℓ(x, u, v, t)dτ o

t ℓ(x, u, v, t)dτ

= min u(t t+dt) max v(t t+dt)

t ℓ(x, u, v, t)dτ + V (t + dt, xt+dt)

Then we can readily obtain:

0 = min u max v

h ℓ(x, u, v, t) + d V

We can express d V as follows:

This form enables us to obtain the min-max Hamilton Jacobi Bellman through substitution to (49):

0 = min u max v

h ℓ(x, u, v, t) + V

t = min u max v

h ℓ(x(t), u, v, t) + V

with its terminal condition being: V (tf, xtf ) = ϕ(tf, xtf ).

B.2 BACKWARDS PROPAGATION

First, we consider equation function Q as: Q(x, u, v, t) = ℓ(x, u, v, t) + d V

dt . We expand the terms inside the minimization in (49) along with function Q up to second order terms, with respect to the nomimal trajectory ( x, u, v) and we obtain:

ℓ( x + δxt, u + δut, v + δvt) =

ℓ( x, u, v, t) + ℓ xδx + ℓ uδu + ℓ vδvt + 1

"δxt δut δvt

# "ℓxx ℓxu ℓxv ℓux ℓuu ℓuv ℓvx ℓvu ℓvv

# "δxt δut δvt

V (t + dt, x + δxt) = V (t, x) + Vx(t, x)δxt + 1

2δx t Vxxδxt (53)

At this point, we take the derivative with respect to time, of the expanded expression of the value function from (23). We note that V, Vx, Vxx are only functions of time, since they are expanded with respect to the nominal x(t). Additionally, we also notice that from the system dynamics the infinitesimal perturbation around the nominal trajectory δxt are also a function of time, so the differentiation of the second and third term below follow the product differentiation rule.

x= x(t) u= u(t) v= v(t)

δxt + V x dδxt

Vxxδxt + δx t d Vxx

dt δxt + δx t Vxx dδxt

Published as a conference paper at ICLR 2024

Returning to the system dynamics, we can write dδx

dt , as follows:

x(t) x(t) = F(x, u, v, t) F( x, u, v, t)

= F( x, u, v, t) + Fxδxt + Fuδut + Fvδvt F( x, u, v, t)

= Fxδxt + Fuδut + Fvδvt

where Fx, Fu, Fv stands for Fx( x, u, v), Fu( x, u, v), Fv( x, u, v) respectively. If we substitute (55) in (54), we obtain:

x= x(t) u= u(t) v= v(t)

δxt + V x ( Fxδxt + Fuδut + Fvδvt)

(( Fxδxt + Fuδut + Fvδvt)) Vxxδxt + δx t d Vxx

dt δxt + δx t Vxx(( Fxδxt + Fuδut + Fvδvt))

Now, if we substitute (56) and (52) in (49), separate the terms that differentiate with respect to time the value function or any of its derivatives with respect to the state, we obtain:

2δx t d Vxx

= min δu max δv

ℓ( x, u, v, t) + ℓ xδx + ℓ uδu + ℓ vδv + 1

"δxt δut δvt

# "ℓxx ℓxu ℓxv ℓux ℓuu ℓuv ℓvx ℓvu ℓvv

# "δxt δut δvt

+ V x Fxδxt + V x Fuδut + V x Fvδvt + 1

"δxt δut δvt

F x Vxx + Vxx Fx Vxx Fu Vxx Fv F u Vxx 0 0 F v Vxx 0 0

"δxt δut δvt

= min δu max δv

ℓ( x, u, v) + (ℓ x + V x Fx)δxt + (ℓ u + V x Fu)δut + (ℓ v + V x Fv)δvt

"δxt δut δvt

ℓxx + F x Vxx + Vxx Fx ℓxu + Vxx Fu Vxx Fv ℓux + F u Vxx ℓuu ℓuv ℓvx + F v Vxx ℓvu ℓvv

"δxt δut δvt

Following the definition of Q and (49), we can expand Q up to second order terms as following:

0 = min u max v

h Q( x, u, v, t) + Q xδx + Q uδut + Q vδvt + 1

"δxt δut δvt

# "Qxx Qxu Qxv Qux Quu Quv Qvx Qvu Qvv

# "δxt δut δvt

Therefore, we can equate the terms from (57) with the terms of the quadratic expansion of (58), and yield the following:

Q0(t) = ℓ( x(t), u(t), t) Qx(t) = (ℓ x + V x Fx) = ℓx + F x Vx Qu(t) = (ℓ u + V xx Fu) = ℓu + F u Vx Qv(t) = (ℓ v + V xx Fv) = ℓv + F v Vx Qxx(t) = ℓxx + Vxx Fx + F x Vxx Qxu(t) = ℓxu + Vxx Fu = Q ux Qxv(t) = ℓxv + Vxx Fv = Q vx Quu(t) = ℓuu Qvv(t) = ℓvv

Published as a conference paper at ICLR 2024

Additionally, taking the derivative in (58) with respect to δut and δvt and setting them equal to 0, yields:

δu = Q 1 uu(Quxδxt + Quvδv t + Qu)

δv = Q 1 vv(Qvxδxt + Qvuδu t + Qv) (60)

However, we notice that the optimal control of the opponent is present in the expression of the optimal control of each player, so we try to eliminate this dependency:

δu = Q 1 uu(Quxδxt + Quv( Q 1 vv(Qvxδxt + Quvδu t + Qv)) + Qu)

(Quu Quv Q 1 vv Qvu)δu = (Qux Quv Q 1 vv Qvx)δxt + Quv Q 1 vv Qv Qu δu = (Quu Quv Q 1 vv Qvu) 1(Quv Q 1 vv Qv Qv (Qux Quv Q 1 vv Qvx)δxt) δu = lu + Kuδxt

where lu = (Quu Quv Q 1 vv Qvu) 1(Quv Q 1 vv Qv Qu), and Ku = (Quu Quv Q 1 vv Qvu) 1((Qux Quv Q 1 vv Qvx)δxt). Similarly we can express δv = lv + Kvδxt, where equivalently the coefficients lv, and Kv are defined as: lv = (Qvv Qvu Q 1 uu Quv) 1(Quv Q 1 uu Qu Qv), Kv = (Qvv Qvu Q 1 uu Quv) 1((Qvx Quv Q 1 vv Qvx)). From (57) and (58), the value function and its first and second order derivatives with respect to x are expressed through the following backward ordinary differential equations:

2δx t d Vxx

= min u max v Q( x, u, v, tsss) + Q xδx + Q uδut + Q vδvt + 1

"δxt δut δvt

# "Qxx Qxu Qxv Qux Quu Quv Qvx Qvu Qvv

# "δxt δut δvt

= Q( x, u, v, t) + Q xδx + Q u(lu + Kuδxt) + Q v(lv + Kvδxt)+

" δxt (lu + Kuδxt) (lv + Kvδxt)

# "Qxx Qxu Qxv Qux Quu Quv Qvx Qvu Qvv

# " δxt (lu + Kuδxt) (lv + Kvδxt)

dt = ℓ+ l u Qu + l v Qv + 1

2l u Quulu + 1

2l v Qvvlv + l u Quvlv

dt = Qx + K u Qu + K v Qv + Q uxlu + Q vxlv + K u Quulu+ +K u Quvlv + K v Qvulu + K v Qvvlv

dt = Qxx + K u Qux + Q ux Ku + K v Qvx + Q vx Kv + K v Qvu Ku+ +K u Quv Kv + K u Quu Ku + Kv Qvv Kv

under terminal conditions: V (tf) = ϕ( x(tf), tf) Vx(tf) = ϕx( x(tf), tf) Vxx(tf) = ϕxx( x(tf), tf)

C CONVERGENCE

We begin our convergence analysis for GTSONO with the definition of the global saddle point, which will prompt us to define the local saddle points. Definition C.1. A point (u , v ) R2n, is a global saddle point of a function Q( , ), if we have f(u , v) f(u , v ) f(u, v ), u Rn, and v Rn.

For convenience on handling the variables we further define the vector z = [u; v] R2n, along

with the function G : R2n R2n, as G(z) = u Q(u, v) v Q(u, v)

. Accordingly the Jacobian of G is

Published as a conference paper at ICLR 2024

given by: JG(z) = Quu Quv Qvu Qvv

R2n 2n. For the inverse of the Jacobian using the Schur

complement Liu et al. (2021b), we obtain:

JG 1 = Q 1 uu Q 1 uu Quv Q 1 vv Q 1 vv Qvu Q 1 uu Q 1 vv

where Quu = Quu Quv Q 1 vv Qvu, and similarly Qvv = Qvv Qvu Q 1 uu Quv. At this point we move to the update rule using the inverse of JG as follows uk+1 vk+1

η Q 1 uu Q 1 uu Quv Q 1 vv Q 1 vv Qvu Q 1 uu Q 1 vv

u Q(u, v) v Q(u, v)

which yields the update rule for the open loop Min-Max DDP shown in section 3. Using the definition z = [u; v], we can write Eq. (66) more compactly as zk+1 zk ηJG 1(zk)G(zk). Definition C.2. Let Kγ = {z | z z γ}, be a neighborhood around fixed point z = [u ; v ]. Then, z is a local saddle point, if f(u , v) f(u , v ) f(u, v ), u, v Kγ.

Two fundamental assumptions regarding the Lipschitz continuity of Q(u, v) in region Kγ, along with the Lipschitz continuity of the Jacobian of G(z), i.e., JG(z), will enable us to derive the bounds for the iterates and function Q. Assumption C.3. We assume that in this region, Q is a C2 function and that Q, and JG are Lipschitz continous x, y Kγ, i.e. we assume the following

||Q(z) Q(z )|| L ||z z || (67) ||JG(z) JG(z )|| L||z z || (68)

Additionally, consider constant h, for which it holds that: 1

h σmax(JG 1), which is defined as the greatest singular value of inverse of the Jacobian of G. Now, using constants h and L, let us define the radius of region Kγ as γ < 2h

We proceeding by mentioning a Lemma which will be deemed helpful in the derivation of bounds for consecutive iterations. Lemma C.4 (Lemma 1 (Lesage-Landry et al. 2020)). If M Rn n is invertible, then h > 0, s.t.

h ||My|| h||y||, y Rn. (69)

In our analysis, we assumed a fixed point z , and local region Kγ such that ||z z || γ, and followed the following steps:

1. Provide the sufficient conditions under which it is guaranteed that this fixed point is a local saddle point in region Kγ. 2. Show that our optimizer indeed converges to this fixed point, granted it is initialized within Kγ, which under the stipulation that the conditions in 4.2 are met has to be a local saddle point. 3. Show that our optimizer avoids unstable fixed points.

We start by stating the following Lemma. Lemma C.5 (Lemma 4, Adolphs et al. (2019)). For a function Q, at least twice differentiable and not necessarily convex and not necessarily concave, a point (u , v ) Kγ is a local saddle point if the following conditions are met: Necessary condition: Q(u , v ) = 0, and Sufficient condition: Quu > 0, and Qvv < 0

Intuitively, this Lemma expresses that Q must be locally convex with respect to u and locally concave w.r.t. v, at least in region Kγ for z to be a local saddle point. Then following Assumption 4.2, it holds that for appropriate selection of Ru > 0 and Rv > 0, we can ensure the positive definiteness of Quu, and the negative definiteness of Qvv. It follows that if this condition is satisfied then it must also be true that Qu = Qv = 0, equivalently G(z ) = 0. Now, we proceed with the following Lemma from which we will derive useful inequalities that will enable us to bound the iterates and the accumulative error of Q.

Published as a conference paper at ICLR 2024

Lemma C.6. Suppose the following conditions are met

1. The initialization falls within a ball of radius γ from z , namely z0 Kγ

2. Assumption 4.2 is satisfied, implying that z is a local saddle, and thus G(z ) = 0.

3. [Lemma 1; Lesage-Landry et al. (2020)]: For square and invertible matrix JG, h > 0, such that ||JG 1(z )|| 1

4. The Jacobian of G(z) is L-Lipschitz: ||JG(z) JG(z )|| L||z z ||

Then the following inequalities emerge between two consecutive iterations

||zk+1 z || < ||zk z || ||zk+1 z || < 3L

2h ||zk z ||2 (70)

Proof. We begin the proof with the update rule, and subtract the fixed point z from both sides.

zk+1 = zk ηJG 1(zk)G(zk) =

zk+1 z = zk z ηJG 1(zk)G(zk)

= zk z ηJG 1(zk)G(zk) + ηJG 1(zk)G(z )

= zk z + ηJG 1(zk)(G(z ) G(zk))

where we used the fact that G(z ) = 0. By the fundamental theorem of calculus,

zk+1 z = zk z + ηJG 1(zk) Z 1

0 JG(zk + τ(z zk))(z zk)dτ

= JG 1(zk)JG(zk)(zk z ) + JG 1(zk) Z 1

0 ηJG(zk + τ(z zk))(z zk)dτ

= JG 1(zk) Z 1

0 JG(zk)(zk z )dτ + JG 1(zk) Z 1

0 ηJG(zk + τ(z zk))(z zk)dτ

= JG 1(zk) Z 1

ηJG zk + τ(z zk) JG(zk) (z zk)dτ

Consider ||ηJG(zk + τ(z zk))|| ||JG(zk + τ(z zk))||, η (0, 1]. Therefore, one obtains the following inequality

||zk+1 z || ||JG 1|| || Z 1

0 (JG(zk + τ(zk z )) JG(zk))(z zk)dτ||

||JG 1|| || Z 1

0 τL||zk z ||2dτ||

2 ||zk z ||2

At this point we use Lemma C.4 to derive an upper bound on the norm of the inverse of the Jacobian of the operator G. For arbitrary vector y Rn

||JG(zk)y|| = ||JG(zk)y + JG(z )y JG(z )y|| ||JG(z )y|| ||JG(zk)y JG(z )y|| ||JG(z )y|| ||JG(zk) JG(z )|| ||y|| h||y|| L||zk z || ||y|| = (h L||zk z ||)||y||

Thus since y Rn, such that ||JG(zk)y|| (h L||zk z ||)||y||, from Lemma C.4 we have that ||JG 1(zk)|| 1 h L||zk z || (75)

Published as a conference paper at ICLR 2024

Substituting Eq. (75) into (73), we obtain

||zk+1 z || L 2(h L||zk z ||)||zk z ||2 (76)

By construction, we have that ||zk z || < 2h 3L, which enables us to obtain the following two identities.

||zk+1 z || L||zk z || 2(h L||zk z ||)||zk z ||

3L 2(h L 2h

3L)||zk z || = ||zk z || (77)

Trivially from Eq. (77), by bounding only the denominator, we obtain

||zk+1 z || L 2(h L||zk z ||)||zk z ||2

< L 2(h L 2h

3L)||zk z ||2 = 3L

2h ||zk z ||2 (78)

yielding Eq. (70).

Now, we are in position to prove convergence to fixed point z .

C.1 PROOF OF THEOREM 4.3

For completeness we restate Theorem 4.3.

Theorem C.7. Suppose z0 Kγ. Then for L-Lipschitz function Q(z), with also L -Lipschitz Jacobian of G(z), the accumulative difference between iterates and fixed point z , as well as the total error from Q(z ) can be bounded by a constant

k=0 ||zk z || O(1)

k=0 |Q(zk) Q(z )| O(1) (79)

Proof. Let us first show that PK k=1 ||zt z || is bounded. By using the second identity obtained by the previous lemma we obtain

k=1 ||zk z || <

k=1 ||zk 1 z ||2 3L

We want to make the RHS of the inequality to have the same index as the LHS

k=1 ||zk 1 z ||2 3L

k=1 ||zk z ||2 3L

2h (e0 e K) (81)

where e0 = ||z0 z ||2, and e K = ||z K z ||2 are the individual error term at the initialization phase and the Kth iteration respectively. From this definition, it follows that e0, e K > 0, hence PK k=1 ||zk z ||2 3L

2h (e0 e K) PK k=1 ||zk z ||2 3L

2h e0. Moreover, we observe that in the RHS we have ||zk z ||2, using the assumption that we fall into the region Kγ, where ||z z || γ yields

k=1 ||zk z ||

k=1 ||zk z ||3Lγ

Published as a conference paper at ICLR 2024

Therefore, we have

k=1 ||zk z || <

k=1 ||zk z ||3Lγ

2h )||zk z || < 3L

k=1 ||zk z || < δ (1 3Lγ

where δ = 3L

2h e0. In the last inequality we can divide with 1 3Lγ

2h , since from our assumption γ < 2h 3L. Therefore, since lim K PK k=0 ||zk z || < δ (1 3Lγ

2h ) is bounded by a constant and

||zk z || < ||zk 1 z ||, it shows that ||zk z || 0, for k . Now, recall that the function Q(z) is also L -Lipschitz at point z , thus

k=1 |Q(zk) Q(z )|

k=1 L ||zk z || (84)

Hence, readily by combining the last two inequalities we obtain

k=1 |Q(zk) Q(z )| < L δ (1 3Lγ

2h ) = O(1) (85)

This implies that the iterates converge and the total error goes to 0, as the number of iterations increases to infinity. We proceed to show that if our algorithm converges to a strictly stable, implying local convergence (Wang et al. (2019)). Recall that for fixed point z = (u , v ), for which it holds f(z ) = 0 if the Jacobian of the update rule satisfies ρ(J) 1, then this point is assymptotically locally stable [Proposition 1.4 (Daskalakis & Panageas (2018))]. The Jacobian of our update rule is given by J = I η([JG] 1 [JG] + (JG 1)F) (86)

C.2 PROOF OF PROPOSITION 4.5

We restate the Proposition for completeness.

Proposition C.8. Considering linearized ODE dynamics, we can deduce that for the derivatives Quu and Qvv, with respect to z = [u, v]: z Quu = z Qvv = 0. This implies that (JG 1) = 0.

Proof. We begin with considering the linear approximation of non-linear dynamics along the nominal trajectory, an approximation upon which the entire DDP scheme is based Pan & Theodorou (2014). This enables us to set second order terms with respect to the dynamics equal to zero, namely Fuu = Fvv = Fuv = Fxu = Fxv = Fxx = 0. Note that this approximation was also held to derive the ODEs in Eq. (23).

Proposition C.9. Recall the ODE we derived for the decomposing Qxx, qi for i = {1, . . . , R m} dqi(t)

dt = Fxqi(t) with terminal condition qi(tf) = yi. We can express qi as

qi(t) = yi exp Z tf

t Fxdτ exp Z tf

t Fxdτ (87)

Using this expression for qi(t), we can calculate its partial derivatives with respect to x, u, and v.

t Fxxdτ exp Z tf

t Fxdτ = (C1 C1)qi(t) = 0 (88)

Published as a conference paper at ICLR 2024

Similarly for the other derivatives,

t Fxudτ exp Z tf

t Fxvdτ exp Z tf

t Fxdτ = 0 (89)

At this point, we can easily move to the expession for the Hessians in Eq. 35, and see that differentation of these expressions with respect to any set of weights results in 0.

Z Fuqi Z Fuqi

= Z Fuuqi + Fuqi,u Z Fuqi + Z Fuqi Z Fuuqi + Fuqi,u

Similarly, it follows for the derivative of every second order matrix with respect to u, or v resulting in JG 1.

From this proposition, we can easily infer from Eq. (86) that ||J|| < 1, η (0, 1). Therefore, it holds that the spectral norm of the Jacobian to be less than 1 and ensuring the stability, showing that our optimizer convergenges to stable saddle points.

D EXPERIMENT DETAILS

Networks and ODE Solver The ODE solver we used for every experiment is the standard Runge Kutta 4(5) adaptive solver (dopri5; Dormand & Prince (1980)) implemented by the torchdiffeq package, with the numerical tolerance set to 1e-3, and fixed integration time [0, 1] for all experiments. All experiments are conducted on the same GPU machine (TITAN RTX) and implemented with pytorch

Model configuration Here, we specify the model We will adopt the following syntax to describe the layer configuration.

Conv(input, output, kernel, stride) FC(input, output)

Using this notation, Table 6 presents the architecture of every model.

Table 6: Network Structure

Discrete Convolution Layers: Re LU(Conv(64, 64, 3, 1)) Re LU(Conv(64, 64, 3, 1)) Conv(64, 64, 3, 1) Continuous Convolution Layers: Re LU(Conv(64, 65, 3, 1)) Feed Forward Layers: Re LU(FC(2304, 500)) FC(500, 10)

Note that for our optimizers, a game theoretic layer would be implemented as Re LU(Layeru + Layerv). Recall that in GTSONO every layer has a game theoretic counterpart and in c-GTSONO we have only convolution game theoretic layers.

Dataset Recall that the datasets examined for our experiments are the CIFAR10, and SVHN. Both image datasets are preprocessed with standardization. CIFAR10 contains 50000 training images and 10000 test images, whereas SVHN contains 73257 digits for training and 26032 for testing. PGD and FGSM require computation of the loss gradient. For this reason, to facilitate and accelerate testing against adversarial attacks, we utilized 40% of the samples in the test set of CIFAR10, and 20% of the test set on SVHN.

Published as a conference paper at ICLR 2024

Tuning process For the comparison among the optimizers, the tuning varied for each optimizer. More specifically, for Adam and SGD only the learning rate was evaluated. For SNOpt, we tuned its learning rate and its regularization term, while the 2 variations of GTSONO were tuned against its learning rate, the regularization constant for the two sets of weights. We perform a grid search for the tuning of the hyper-parameters for every optimizer. The tuning process took place by optimizing the performance of each optimizer in the clean test set and evaluating its robustness when convergence was noticed. Robust overfitting was observed when evaluating the robustness of the models many iterations after convergence, regardless of the optimizer. Table 7 summarizes the hyperparameters evaluated for each optimizer, along with the examined values.

Table 7: Hyperparameters for each optimizer and their examined values

Optimizer Hyperparameter Value

Adam, SGD Learning rate {1e 4, 2.5e 4, 5e 4, 7.5e 4, 1e 3, 2.5e 3, 5e 3, 7.5e 3, 1e 2}

SNOpt Learning rate, Regularization constant {1e 4, 2.5e 4, 5e 4, 7.5e 4, 1e 3, 2.5e 3, 5e 3, 7.5e 3, 1e 2} {1e 2, 2.5e 2, 5e 2}

GTSONO Learning rate, Regularization constants Ru, Rv

{1e 4, 2.5e 4, 5e 4, 7.5e 4, 1e 3, 2.5e 3, 5e 3, 7.5e 3, 1e 2} {1e 4, 2.5e 4, 5e 4, 1e 3, 2.5e 3, 5e 3, 1e 2, 2.5e 2, 5e 2}

Attacks Recall that in this study, we focused on the white box ℓ -norm Projected Gradient Descent, Fast Gradient Sign Method attacks and gray/ black-box attack CW attack to assess the robustness of our models. For ℓ PGD attack, the update rule to generate adversarially attacked examples is given as follows x ΠB (x + η1 sign( x L(F(x ), y)))

, where ΠB is the projection operator on the ℓ norm.

For the FGSM attack, the update rule to generate adversarially attacked examples is given as follows

x x + η2 sign( x L(F(x), y)))

For the update rule to generate adversarially attacked examples through CW, we refer the interested reader to Carlini & Wagner (2017).

Adversarial Training Methods For the second round of our experiments we explore the versatility of GTSONO in its ability to be efficiently adapted to other adversarial training methods such as the Free Adversarial Training(Free AT) scheme (Shafahi et al. (2019)) and the TRADES objective function Zhang et al. (2019b). In both of these frameworks, SGD optimizer was employed as the standard optimizer.

More explicitly, Free AT proposed the modification of training each minibatch m times providing strong adversarial examples, while also using the perturbation from the previous minibatch to warmstart the perturbation for the new minibatch. This is the hyperparameter we performed our ablation analysis, by setting m equal to 4 and 8. Algorithm 2 summarizes the Free AT method.

From our results, it was shown that GTSONO was able to outperform the benchmark optimizer in this adversarial training method, and in less epochs resulting in an overall faster training time, leveraging the faster convergence of our second order method, thus rendering it a more efficient and effective method for this adversarial training scheme.

Furthermore, TRADES proposed a new objective function which computes the KL divergence, between natural and adversarial images, multiplied with a constant 1/λ. This is the hyperparameter against which the ablated analysis. Algorithm 3 summarizes the TRADES formulation. It was again demonstrated that GTSONO was able not only to outperform the benchmark optimizer in terms of raw performance but it was again shown that it is the more efficient option for this adversarial training method, as it provided superior performance in less epochs, reducing significant the overall training time.

Published as a conference paper at ICLR 2024

Algorithm 2 Free Adv. Training Algorithm

1: Input: Samples X D, perturbation ϵ, training steps m, learning rate η 2: Initialize θ, δ = 0 3: for epoch = 1 .. N 4: for minibatch B X 5: for i = 1 .. m 6: gθ E(x,y) B[ θℓ(x + δ, y; θ)] 7: gadv xℓ(x + δ, y; θ) 8: Update θ, θ θ ηgθ 9: Calculate δ, δ δ + ϵsign(gadv) 10: δ clip(δ, ϵ, ϵ) 11: end for 12: end for 13: end for

Algorithm 3 TRADES Algorithm

1: Input: Step sizes η1, η2, batch size m, number of iterations m in inner optimization, perturbation ϵ, network with parameters θ 2: repeat 3: Read from minibatch B = {x1, . . . , xb} 4: for i = 1 .. b 5: xi xi + 0.001N(0, I), where N(O, I) is the Gaussian distribution with 0 mean and identity variance 6: for k = 1 .. m 7: x i Πxi,ϵ(η1sign( x i L(fθ(x i, xi))) + xi) 8: end for 9: end for 10: θ η2 θ[L(fθ(xi), yi) + L(fθ(xi), fθ(x i))/λ] 1

b 11: until converges

For our experiments, we selected the perturbation distance ϵ to be equal to 0.03 in both ablation studies. The number of internal iteration in Free AT is the hyperparameter we performed our ablation analysis, by setting m equal to 4 and 8. In our adaptation of GTSONO in TRADES, we set the iternal iterations to be equal to 5, and we performed ablation study with respect to constant λ.

Limitations As shown in the Tables above, the achieved robustness of our optimizer comes at the price of a slower per iteration training time and higher memory consumption. This is attributed to the decomposition and calculation of the three matrices involved in the open loop min-max DDP update Quu, Quv, Qvv. Although, memory consumption indicated in Tables 8-12 suggest that GTSONO is certainly affordable for current state-of-the-art GPUs, its requirements for memory usage and its greater training-per-iteration times motivate us to experiment with recasting the proposed optimizer through distributed optimization distributed DDP schemes.

Additionally, another shortcoming noticed not only in our method by in general in every neural ODE model was the robust overfitting. More specifically, after convergence was observed, even for stagnant natural accuracy, we observed that their robust accuracy decreased as the number of epochs increased. This naturally creates future lines of work of experimenting with more network structures and types, as well as regularization techniques to attempt to reduce this robust overfitting.

Training Time and Memory Consumption At this point we compare the resources required by each optimizer.

1. Optimizer Comparison: For this round of experiments, we compared optimizers in the task of image classification, both in the natural test set and recorded their natural accuracy but also their robust accuracy under various attacks, such as FGSM and PGD for various degrees of perturbation. Every optimizer was trained with a batch size of 500, and for 15 epochs. Tables 8 and 9 present the time and consummed memory by each optimizer on CIFAR10 and SVHN respectively.

Table 8: Training time and computational resources for Optimizers on CIFAR10

Optimizer Parameters (106) Total training time (min : sec) Memory Consumption (GB)

Adam 1.25 2:53 2.13 SGD 1.25 2:37 2.13 SNOpt 1.25 3:01 4.58

GTSONO 2.51 4:01 6.71 C-GTSONO 1.35 3:53 6.71

Published as a conference paper at ICLR 2024

Table 9: Training time and computational resources for Optimizers on SVHN

Optimizer Parameters (106) Total training time (min : sec) Memory Consumption (GB)

Adam 1.25 4:13 2.13 SGD 1.25 4:09 2.13 SNOpt 1.25 4:23 5.50

GTSONO 2.51 6:05 6.71 C-GTSONO 1.35 5:22 6.71

2. Adaptation to adversarial training methods: As mentioned previously, we evaluate the robustness of each optimizer in each measurement after convergence has been observed. Figure 3 illustrates the faster convergence of GTSONO compared to SGD in both defense schemes. More specifically, taking into consideration the graphs in Figure 3, we list the epochs for which each optimizer was trained.

Free AT m = 4: SGD 15 epochs GTSONO 10 epochs Free AT m = 8: SGD 40 epochs GTSONO 5 epochs TRADES λ = 1/6: SGD 20 epochs GTSONO 10 epochs TRADES λ = 1/10: SGD 20 epochs GTSONO 10 epochs

Table 10: Training time and computational resources for Free AT, for m = 4

Optimizer Parameters (106) Training time per iteration (sec) Total training time (min:sec) Memory Consumption (GB)

SGD 1.25 0.53 13:15 2.13

C-GTSONO 1.35 0.70 11:39 6.09

Table 11: Training time and computational resources for Free AT, for m = 8

Optimizer Parameters (106) Training time per iteration (sec) Total training time (min:sec) Memory Consumption (GB)

SGD 1.25 0.63 64:45 2.13

C-GTSONO 1.35 1.1 8:21 6.09

Table 12: Training time and memory resources for TRADES

Optimizer Parameters (106) Training time per iteration (sec) Total training time (min:sec) Memory Consumption (GB)

SGD 1.25 0.92 30:30 3.48

C-GTSONO 1.35 1.39 22:18 7.61