# neural_piecewiseconstant_delay_differential_equations__ab051acb.pdf

Neural Piecewise-Constant Delay Differential Equations

Qunxi Zhu,1* Yifei Shen,2* Dongsheng Li,3 Wei Lin1,4

1 Research Institute of Intelligent Complex Systems, State Key Laboratory of Medical Neurobiology and MOE Frontiers Center for Brain Science, Institutes of Brain Science, Fudan University 2 Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong 3 Microsoft Research Asia 4 School of Mathematical Sciences, Shanghai Center for Mathematical Sciences, Center for Computational Systems Biology, and LCNBI, Fudan University {qxzhu16,wlin}@fudan.edu.cn, yshenaw@connect.ust.hk, dongsheng.li@microsoft.com

Continuous-depth neural networks, such as the Neural Ordinary Differential Equations (ODEs), have aroused a great deal of interest from the communities of machine learning and data science in recent years, which bridge the connection between deep neural networks and dynamical systems. In this article, we introduce a new sort of continuous-depth neural network, called the Neural Piecewise-Constant Delay Differential Equations (PCDDEs). Here, unlike the recently proposed framework of the Neural Delay Differential Equations (DDEs), we transform the single delay into the piecewiseconstant delay(s). The Neural PCDDEs with such a transformation, on one hand, inherit the strength of universal approximating capability in Neural DDEs. On the other hand, the Neural PCDDEs, leveraging the contributions of the information from the multiple previous time steps, further promote the modeling capability without augmenting the network dimension. With such a promotion, we show that the Neural PCDDEs do outperform the several existing continuousdepth neural frameworks on the one-dimensional piecewiseconstant delay population dynamics and real-world datasets, including MNIST, CIFAR10, and SVHN.

Introduction Recently, many frameworks have been established, connecting dynamical systems tightly with neural networks and promoting the network performances significantly (E 2017; Li et al. 2017; Haber and Ruthotto 2017; Chang et al. 2017; Pathak et al. 2018; Fang, Lin, and Luo 2018; Li and Hao 2018; Lu et al. 2018; E, Han, and Li 2019; Chang et al. 2019; Ruthotto and Haber 2019; Zhang et al. 2019; Zhu, Ma, and Lin 2019; Tang et al. 2020). One framework of milestone is the Neural Ordinary Differential Equations (NODEs), also regarded as continuous-depth neural networks (Chen et al. 2018). The framework non-trivially extends the traditional residual neural networks (Res Nets) (He et al. 2016) to parametric ordinary differential equations (ODEs), where the time in NODEs is treated as the depth of the Res Nets. Actually, different from the traditional neural networks, the

*This work was done when the first two authors were interns at Microsoft Research Asia. Copyright 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

NODEs model the vector fields of the ODEs through optimizing their parameters with the back-propagation algorithm and the ODE solver based on the training data and a predefined loss function. In addition, the NODEs contain a broad range of architectures, including the feed-forward neural networks and the convolution neural networks. Owing to the constant memory cost, the continuous dynamical behavior, and the naturally-rooted invertibility of the NODEs, applications of such a framework to modeling physical systems are growing. Examples abound: data analytics on the time series with irregular sampling duration (Rubanova, Chen, and Duvenaud 2019; De Brouwer et al. 2019; Kidger et al. 2020), generations of the continuous normalizing flow (Chen et al. 2018; Grathwohl et al. 2018; Finlay et al. 2020; Deng et al. 2020; Kelly et al. 2020), and representations of the point clouds (Yang et al. 2019; Rempe et al. 2020). It is worthwhile to mention that the framework of NODEs, in spite of its wide applicability, is not of a universal approximator. It thus cannot successfully learn representative maps such as the reflections or the concentric annuli due to the homeomorphism property of ODEs (Dupont, Doucet, and Teh 2019; Zhang et al. 2020). To address this problem, several practical schemes (Dupont, Doucet, and Teh 2019; Massaroli et al. 2020; Zhu, Guo, and Lin 2021) have been suggested and implemented, among which the Neural Delay Differential Equations (NDDEs) show an outstanding efficacy in approximating functionals based on given data (Zhu, Guo, and Lin 2021). Additionally, variants of extensions and applications of the NODEs have been proposed in recent years, including the partial differential equations (Han, Jentzen, and Weinan 2018; Ruthotto and Haber 2019; Sun, Zhang, and Schaeffer 2020) and the stochastic differential equations (Liu et al. 2019; Jia and Benson 2019; Li et al. 2020; Song et al. 2020). In this article, inspired by a recent framework of NDDEs, as mentioned above, we develop a new framework of continuous-depth neural networks with different configurations of delay(s). Although the NDDEs not only allow the trajectories to intersect with each other even in a lowerdimensional phase space but also accurately model representative delayed physical/biological systems, such as the Mackey-Glass system (Mackey and Glass 1977), they likely suffer from tremendously high computational cost. Clearly,

The Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI-22)

such a shortcoming is due to the persistent existence of the effect induced by t τ, the dynamical delay. In addition, continuously improving the feature representation of a neural architecture is also a challenging direction of machine learning. To conquer these difficulties, we therefore propose the model as mentioned above. Indeed, the proposed model mainly consists of the following two configurations: (1) novelly transforming the dynamical delay in NDDEs into a piecewise-constant delay, viz. t τ , and (2) introducing multiple piecewise-constant delays into the vector field to significantly promote the feature representation. The advantages of the first configuration include preserving the computational efficacy with the simple piecewiseconstant delay and maintaining the capabilities of modeling complex dynamics (chaos) using the discontinuous nature of the particular form of this delay. The advantage of the second configuration involves leveraging the information from not only the current time point but also many previous time points and thus strengthening the feature propagation. All these advantages definitely result in a better feature representation, comparing to the NDDEs. Mathematically, our model is originated from a well-developed class of delay differential equations, called the piecewise-constant delay differential equations (PCDDEs) (Carvalho and Cooke 1988; Cooke and Wiener 1991; Jayasree and Deo 1992). We therefore refer our model of continuous-depth neural networks to the neural PCDDEs (NPCDDEs). To further improve the performance, we propose an extension of the NPCDDEs without sharing the parameters in different time duration, called the unshared NPCDDEs (UNPCDDEs). To summarize, the major contributions of this article are multi-folded, including:

establishment of a generic continuous-depth model, NPCDDEs, such that typical neural networks, such as Res Nets and NODEs, are the special cases of the UNPCDDEs, validation of the NPCDDEs having the capability of universal approximation (see Proposition 2), formulation of the adjoint dynamical system and the backward gradients for the NPCDDEs (see Theorem 2), and demonstrations of the powerful nonlinear representation of NPCDDEs on the synthetic data produced by the one-dimensional piecewise-constant delay population dynamics and on the representative image datasets, i.e., MNIST, CIFAR10, and SVHN, as well.

Related Work

Neural Ordinary Differential Equations As pointed out by (Chen et al. 2018), the NODEs can be regarded as the continuous version of the Res Nets having an infinite number of layers (He et al. 2016). The residual block of the Res Nets is mathematically written as zt+1 = zt + f(zt, θt), where zt is the feature at the t-th layer, and f( , ) is a dimensionpreserving and nonlinear function parametrized by a neural network with θt, the parameter vector pending for learning. Notably, such a transformation could be viewed as the spe-

cial case of the following discrete-time equations:

t = f(zt, θt) (1)

with t = 1. In other words, as t in (1) is set as an infinitesimal increment, the Res Nets could be regarded as the Euler discretization of the NODEs which read:

dt = f(z(t), θ). (2)

Here, the shared parameter vector θ, which unifies the vector θt of every layer in Eq. (1), is injected into the vector field across the finite time horizon, to achieve parameter efficiency of the NODEs. As such, the NODEs can be used to approximate some unknown function F : x 7 F(x). Specifically, the approximation is achieved in the following manner: Constructing a flow of the NODEs starting from the initial state z(0) = x and ending at the final state z(T) with z(T) F(x). Thus, a standard framework of the NODEs, which takes the input as its initial state and the feature representation as the final state, is formulated as:

z(T) = z(0) + Z T

0 f(z(t), θ)dt

= ODESolve(z(0), f, 0, T, θ), z(0) = input,

where T is the final time and the solution of the above ODE can be numerically obtained by the standard ODE solver using adaptive schemes. Indeed, a supervised learning task can be formulated as:

minθ L(z(T)), s.t. Eq. (2) holds for any t [0, T], (4)

where L( ) is a predefined loss function. To optimize the loss function in (4), we need to calculate the gradient with respect to the parameter vector. This calculation can be implemented with a memory in an order of O(1) by employing the adjoint sensitivity method (Chen et al. 2018; Pontryagin 1987) as:

T a(t) f(z(t), θ)

where a(t) := L z(t) is called the adjoint, representing the gradient with respect to the hidden states z(t) at each time point t.

Variants of NODEs As shown in (Dupont, Doucet, and Teh 2019), there are still some typical class of functions that the NODEs cannot represent. For instance, the reflections, defined by g1d : R R with g1d(1) = 1 and g1d( 1) = 1, and the concentric annuli, defined by g2d : R2 R with

g2d(x) = 1, if x r1, 1, if r2 x r3, (6)

where is the L2 norm, and 0 < r1 < r2 < r3. Such successful constructions of the two counterexamples are attributed to the fact that the feature mapping from the input (i.e., the initial state) to the features (i.e., the final state) by

the NODEs is a homeomorphism. Thus, the features always preserve the topology of the input domain, which mathematically results in the impossibility of separating the two connected regions in (6). A few practical strategies have been timely proposed to address this problem. For example, proposed creatively in (Dupont, Doucet, and Teh 2019) was an argumentation of the input domain into a higher dimensional space, which makes it possible to have more complicated dynamics emergent in the Augmented NODEs. Very recently, articulated in (Zhu, Guo, and Lin 2021) was a novel framework of the NDDEs to address this issue without argumentation. Actually, such a framework was inspired by a broader class of functional differential equations, named delay differential equations (DDEs), where a time delay was introduced (Erneux 2009). Fox example, a simple form of NDDEs reads:

dt = f(z(t τ), θ), t [0, T],

z(t) = ϕ(t) = x, t [ τ, 0], (7)

where τ is the delay effect and ϕ(t) is the initial function. Hereafter, we assume ϕ(t) as a constant function, i.e., ϕ(t) x with input x. Due to the infinite-dimension nature of the NDDEs, the crossing orbits can be existent in the lower-dimensional phase space. More significantly as demonstrated in (Zhu, Guo, and Lin 2021), the NDDEs have a capability of universal approximation with T = τ in (7).

Control Theory Training a continuous-depth neural network can be regarded as a task of solving an optimal control problem with a predefined loss function, where the parameters in the network act as the controller (Pontryagin 1987; Chen et al. 2018; E, Han, and Li 2019). Thus, developing a new sort of continuous-depth neural network is intrinsic or equivalent to designing an effective controller. Such a controller could be in a form of open-loop or closedloop. Therefore, from a viewpoint of control, all the existing continuous-depth neural networks can be addressed as control problems. However, these problems require different forms of controllers. Specifically, when we consider the continuous-depth neural network dx(t)

dt = f(x(t), u(t), t), u(t) is regarded as a controller. For example, u(t) treated as constant parameters yields the network frameworks proposed in (Chen et al. 2018), u(t) as a data-driven controller yields a framework in (Massaroli et al. 2020), and u(t) as other forms of controllers brings more fruitful network structures (Chalvidal et al. 2020; Li et al. 2020; Kidger et al. 2020; Zhu, Guo, and Lin 2021). Here, the mission of this work is to design a delayed feedback controller for rendering a continuous-depth neural network more effectively in coping with synthetic or/and real-world datasets.

Neural Piecewise-Constant Delay Differential Equations In this section, we propose a new framework of continuousdepth neural networks with delay (i.e., the NPCDDEs) by an articulated integration of some tools from machine learning and dynamical systems: the NDDEs and the piecewise-

constant DDEs (Carvalho and Cooke 1988; Cooke and Wiener 1991; Jayasree and Deo 1992). We first transform the delay of the NDDEs in (7) into a form of the piecewise-constant delay (Carvalho and Cooke 1988; Cooke and Wiener 1991; Jayasree and Deo 1992), so that we have

dt = f(z( t

τ), θ), t [0, T],

z(0) = x, (8)

where the final time T = nτ and n is supposed to be a positive integer hereafter. We note that the NPCDDEs in (8) with T = τ is exactly the NDDEs in (7), owning the universal approximation as mentioned before. As the vector filed of the NPCDDEs in (8) is constant in each interval [kτ, kτ +τ] for k = 0, 1, ..., T

τ , the simple NPCDDEs in (8) can be treated as a discrete-time dynamical system:

z(k + 1) = z(k) + τf(z(k), θ) := ˆF(z(k), θ). (9)

Actually, this iterative property of dynamical systems enables the NPCDDEs in (8) to learn some functions with specific structures more effectively. For example, if the map F(x) = c2x with a large real number c > 0 is pending for learning and the vector field is set as:

τ), θ) := az( t

τ) + b (10)

with τ = 1 and the initial parameters a = b = 0 before training, then, we only use T = 2τ as the final time for the NPCDDEs in (8) and require x(τ) to learn the small coefficient in the linear function x 7 cx (or, equivalently, require f to learn x 7 (c 1)x). As such, the feature x(T) (x(τ))2 c2x naturally approximates the above-set function F(x), because F(x) can be simply represented as two iterations of the function ˆF(x) = cx, i.e., ˆF ˆF(x) = F(x). We experimentally show the structural representation power in Fig. 1, where the training loss of the NPCDDEs in (8) with T = 2τ decreases faster than that only with T = τ. Given the above example, the following question arises naturally: For any given function x 7 F(x), does there exist a function x 7 ˆF(x) such that the functional equation ˆF ˆF(x) = F(x) (11)

holds? Unfortunately, the answer is no, which is rigorously stated in the following proposition.

0 200 Epochs

0 200 Epochs

0 200 Epochs

Figure 1: The training processes for fitting the function F(x) = 16x using the NPCDDEs in (8), respectively, with the final times T = τ and T = 2τ. The training loses (left), and the evolution of two parameters a (middle), and b (right), as defined in (10) during the training processes.

Proposition 1 (Radovanovic 2007) There does not exist any function f : R R such that f(f(x)) = x2 2 for all x R. As shown in Proposition 1, although the iterative property of NPCDDEs in (8) allows the effective learning of functions with certain structure, the solution of the functional equation (11) does not always exist. This thus implies that (8) cannot represent a wide class of functions (Rice, Schweizer, and Sklar 1980; Chiescu and Gdea 2011). To further elaborate this point, we use T = τ and T = 2τ, respectively, for the NPCDDEs in (8) to model the function g2d(x) as defined in (6). Clearly, Fig. 2 shows that the training processes for fitting the concentric annuli using (8) with the two delays are different. Contrary to the preceding example, the training loss of one with T = τ decreases much faster than that of the one with T = 2τ. In order to sustain the capability of universal approximation from the NDDEs to the current framework, we modify the NPCDDEs in (8) by adding a skip connection from the time 0 to the final time 2τ in the following manner:

dt = f(z( t

τ), θ), t [0, 2τ],

z( τ) = z(0) = x. (12) As can be seen from Fig. 2, the training loss of the modified NPCDDEs in (12)) decreases outstandingly faster than that of the NPCDDEs in (8) with T = 2τ and that of NODEs. Also, it is slightly faster than the one with T = τ. Moreover, the dynamical behaviors of the feature spaces during the training processes using different neural frameworks are shown in Fig. 3. In particular, the NPCDDEs in (12) first separate the two clusters among these models at the 3rd training epoch, which is beyond the ability of the baselines. More importantly, the following theorem demonstrates that NPCDDEs in (12) are universal approximators, whose proof is provided in the supplementary material. Theorem 1 (Universal approximation of the NPCDDEs in (12)) Consider the NPCDDEs in (12) of n-dimension. If, for any given function F : Rn Rn, there exists a neural network g(x, θ) that can approximate the map G(x) = 1 2τ [F(x) x], then the NPCDDEs that can learn the map x 7 F(x). In other words, we have z(T) F(x) provided that both the initial states z( τ) and z(0) are set as x, the input. Notice that, for the NPCDDEs in (8) and the modified NPCDDEs in (12), their vector fields keep constant in a τ period of time. More generally, we can extend these models by adding the dependency on the current state, enlarging the value of the final time, and introducing more skip connections from the previous time to the current time. As such, a more generic framework of the NPCDDEs reads:

dt =f(z(t), z( t

τ), ..., z( t nτ

t [0, T], z( nτ) = = z( τ) = z(0) = x, (13)

0 10 20 30 Epochs

NPCDDE NPCDDE1 NPCDDE2 NODE

(a) (c) (e)

(b) (d) (f)

NODE NPCDDE1

NPCDDE2 NPCDDE

Figure 2: The training processes for fitting the function g2d(x). (a) The training loses, respectively, using the NODEs, the NPCDDEs in (8) with n = 1 and τ = 1, the NPCDDEs in (8) with n = 2 and τ = 0.5, and the special NPCDDEs in (12) with τ = 0.5. (b) A part of the training dataset for visualization. The flows mapping from the initial states to the target states, respectively, by the NODEs (c), the NPCDDEs in (8) with n = 2 and τ = 0.5 (d), NPCDDEs in (8) with n = 1 and τ = 1 (e), and the special NPCDDEs in (12) with τ = 0.5 (f). The red (resp. blue) points and the yellow (resp. cyan color) points are the initial states and the final states of all the flows, respectively.

where T = nτ with n being a positive integer. Analogous to the proof of Theorem 1, the universal approximation of the NPCDDEs in (13) can be validated (see Proposition 2). Proposition 2 The NPCDDEs in (13) have a capability of universal approximation. To further improve the modeling capability of the NPCDDEs, we propose an extension of the NPCDDEs without sharing the parameters, which reads:

dt =f(z(t), z( t

z(0), θk), t [kτ, kτ + τ], z(0) =x,

where θk is the parameter vector used in the time interval [kτ, kτ + τ] for k = 0, 1, ..., n 1. For simplicity, we name such a model as unshared NPCDDEs (UNPCDDEs). As in the Res Nets (1), a typical neural network, the parameters in each layer are independent with the parameters in the other layer. Moreover, the gradients of the loss with respect to the parameters of the UNPCDDEs in (14) are shown in Theorem 2, whose proof is provided in the supplementary material. Moreover, setting θk θ straightforwardly in Theorem 2 enables us to compute the gradients of the NPCDDEs in (13). Theorem 2 (Backward gradients of the UNPCDDEs in (14)) Consider the loss function L(z(T)) with the final time T = nτ. Thus, we have

d L dθk = Z kτ

kτ+τ a(t) f

θk dt, (15)

NODE NPCDDE1 NPCDDE NPCDDE2

Figure 3: The dynamical behaviors of the feature spaces during the training processes (totally 6 epochs from the left column to the right column) for fitting g2d(x) using different models: the NODEs (the top row), the NPCDDEs in (8) with n = 1 and τ = 1 (the second row), the NPCDDEs in (8) with n = 2 and τ = 0.5 (the third row), and the special NPCDDEs in (12) with τ = 0.5 (the bottom row).

where the dynamics of the adjoint can be specified as:

dt = a(t) f

z(t), t [kτ, kτ + τ]

a(lτ) = a(lτ) + Z kτ

kτ+τ a(t) f z(lτ)dt,

l = 0, 1, , k,

where the backward initial condition a(T) = L((T ))

z(T ) and k = n 1, n 2, , 0.

We note that in (16), due to the skip connections, analogous to Dense Nets (Huang et al. 2017), the gradients are accumulated from multiple paths through the reversed skip connections in the backward direction, which likely renders the parameters optimized sufficiently. Additionally, if the loss function L(z(T)) depends on the states at different time points, viz., the new loss function L(z(t0), z(t1), ..., z(t N)), we need to update instantly the adjoint state in the backward direction by adding the partial derivative of the loss at the observational time point, viz. a(ti) = a(ti)+ L z(ti). For the specific tasks of classification and regression, refer to the section of Experiments.

Major Properties of NPCDDEs

The NPCDDEs in (13) and the UNPCDDEs in (14) generalize the Res Nets and the NODEs as well. Also, they have strong connections with the Augmented NODEs. Moreover, the discontinuous nature of the NPCDDEs enables us to model complex dynamics beyond the NODEs, the Augmented NODEs, and the NDDEs. Lastly, the NPCDDEs are shown to enjoy advantages in computation over the NDDEs. In the sequel, we discuss these properties.

Both the Res Nets and the NODEs are Special Cases of the UNPCDDEs in (14). We emphasize that any dimension-preserving neural networks (multi-layer residual blocks) are special cases of the UNPCDDEs. Actually, one can enforce the z(t), z( t τ

τ τ), z( t 2τ

τ τ), , z(0) as the dummy variables in the vector field of (13) by assigning the weights connected to these variables to be zero, except for the variable z( t

τ τ). Moreover, letting τ = 1 results in very simple unshared NPCDDEs as: dz(t)

dt = f(z(k), θk), t [k, k + 1], z(0) = x. (17)

Due to the vector field of (17) keeping constant in each interval [k, k + 1], we have z(k + 1) = z(k) + f(z(k), θk), z(0) = x, (18) which is exactly the form of the Res Nets (1). In addition, if we let z( t

τ τ), z( t τ

τ τ), ..., z(0) as the dummy variables in the vector field of (13) and set θk θ, the UNPCDDEs in (14) indeed become the typical NODEs. Interestingly, though the NODEs are inspired by the Res Nets, they are not equivalent to each other because of the limited modeling capability of the NODEs. But UNPCDDEs in (14) provides a more general form of the two.

Connection to Augmented NODEs The NPCDDEs in (13) can be viewed as a particular form of the Augmented NODEs:

dt = f(z(t), z0(t), z1(t), ..., zn(t), θ), t [0, T],

dt = 0, z0(t) = z( t

dt = 0, zn(t) = z( t nτ

z( nτ) = z( τ) = z(0) = x. (19) Hence, we can apply the framework of the NODEs to coping with the NPCDDEs by solving the Augmented NODEs in (19). It is worthwhile to emphasize that the Augmented NODEs in (19) are not trivially equivalent to the traditional Augmented NODEs developed in (Dupont, Doucet, and Teh 2019). In fact, the dynamics of zi(t) in (19) are piecewiseconstant (but z(t) is continuous) and thus discontinuous at each time instant kτ, while the traditional Augmented NODEs still belong to the framework of NODEs whose dynamics are continuously evolving. The benefits of discontinuity are specified in the following.

Discontinuity of the Piecewise-Constant Delay(s) Notice that used in the piecewise-constant delay(s) is a discontinuous function, which makes the first-order derivative of the function discontinuous at each key time point (i.e., integer multiple of the time delay). This characteristic overcomes a huge limitation, the homeomorphisms/continuity of the trajectories produced by the NODEs, and thus enhances the flexibility of the NPCDDEs to handling plenty of complex dynamics (e.g., jumping derivatives and chaos evolving in the lower dimensional space). We will validate this

NPCDDE (ours)

𝑡 0, 𝜏 𝑡= 𝜏

Rd Rd Rd Rd Rd Rd Rd Rd Rd Rd

Figure 4: Sketches of different kinds of continuous-depth neural networks, including the NODEs, the NDDEs, and our newly proposed framework, the NPCDDEs. Specifically, ϕ(t) z(0) as a constant function is the initial function for the NDDEs. For the NPCDDEs in (13), at each time point in the interval [kτ, kτ + τ], the time dependencies are unaltered, different from the dynamical delay in the NDDEs.

advantage in the section of Experiments. Additionally, the simple Euler scheme for the ODEs in (1) is actually a special PCDDE: dz(t)

dt = f(z( t

t t)) (Cooke and Wiener 1991). Based on the discontinuous nature, the approximation of the DDEs using the PCDDEs has been validated in (Cooke and Wiener 1991). Finally, such kind of discontinuous settings could be seen as typical forms of those discontinuous control strategies that are frequently used in control problems (Evans 1983; Lewis, Vrabie, and Syrmos 2012). Actually, discontinuous control strategies can bring benefits on time and energy consumption (Sun et al. 2017).

Computation Advantages of NPCDDEs over NDDEs For solving the conventional NDDEs in (7), we need to recompute the delay states in time using appropriate ODE solver (Zhu, Guo, and Lin 2021), which requires O(n) memory and O(n K) computation, where K is the adaptive depth of the ODE solver. On the contrary, for NPCDDEs in (13) and the unshared NPCDDEs in (13), the delays are constant, and thus recomputing is not needed. As a result, for NPCDDEs (or UNPCDDEs), computational cost is approximately in orders of O(n) and O(K). Thus, the computational cost of NPCDDEs is cheaper than NDDEs.

Experiments Population Dynamics: One-Dimensional PCDDE We consider a 1-d PCDDE, which reads:

dt = ax(t)(1 x( t )), x(0) = x0 0. (20)

where the growth parameter a > 0 (Carvalho and Cooke 1988; Cooke and Wiener 1991). The above PCDDE (20) is analogous to the well-known, first-order nonlinear logistic differential equation of one-dimension, which describes the growth dynamics of a single population and can be written as: dx(t)

dt = ax(t)(1 x(t)), x(0) = x0 0. (21)

Clearly, replacing the term 1 x(t) in the vector field of (21) by the term 1 x( t ) results in the vector field of (20). For each given a > 0 and x0 0, if we consider the state x(t) at the integer time instants t = 0, 1, 2, , the corresponding discrete sequence, x(0), x(1), x(2), , satisfy the following discrete dynamical system:

x(t + 1) = x(t)ea(1 x(t)), t = 0, 1, 2, . (22)

Thus, we study the function

fa(x) = xea(1 x), x [0, ). (23)

Direct computation indicates that the function fa( ) in (23) is a C1-unimodal map in [0, ] with its maximal value as fa(x ) = fa( 1

a). Thus, [0, 1

a] is a strictly increasing regime of this function while [ 1

a, ) is a strictly decreasing regime. As pointed out in (Carvalho and Cooke 1988; Cooke and Wiener 1991), the discrete dynamical system (22) can exhibit complex dynamics including chaos. More precisely, at a = 3.11670..., the solution of (22) with the initial value x(0) = x0 = 1 a is periodic and asymptotically stable with a period of three, so that fa fa fa (x0) = x0. This further implies that the map with the adjustable parameter a admits period-doubling bifurcations and thus has chaotic dynamics according to the well-known Sharkovskii Theorem (Li and Yorke 1975; Carvalho and Cooke 1988; Cooke and Wiener 1991). Moreover, since the discrete dynamical system (22) could be regarded as the sampled system with integer sampling time instants from the original PCDDE (20), this PCDDE exhibits chaotic as well for a in the vicinity of a . We thereby test the NODEs, the NDDEs, the NPCDDEs, and the Augmented NODEs on the piecewise-constant delay population dynamics (20), respectively, with a = 2.0 and a = 3.2, which corresponds to two regimes of oscillation and chaos. Moreover, as can be seen from Fig. 5, the training losses and the test losses of the NPCDDEs decrease significantly, compared to those of the other models. Additionally, in the oscillation regime, the losses of the NPCDDEs approach a very low level in both training and test stages, while in the chaos regime, the NPCDDEs can achieve shortterm prediction in an accurate manner. Naturally, it is hard to achieve long-term prediction because of the sensitive independence of initial conditions in a chaotic system. Here, for training, we produce 100 time series from different initial states in the time interval [0, 3] with 0.1 as the sampling period. Thus, still with 0.1 as the sampling period, we use the final states of the training data as the initial states for the 100 test time series in the next time interval [3, 13]. More specific configurations for our numerical experiments are provided in the supplementary material.

Image Datasets We conduct experiments on several image datasets, including MNIST, CIFAR10, SVHN, by using the (unshared) NPCDDEs and the other baselines. In the experiments, we

0 1500 3000 0.0

Training loss

0 1500 3000 0.0

0 1500 3000 0.0

0 1500 3000 0.0

0 1500 3000 0.0

0 1500 3000 Epochs

Training loss

0 1500 3000 Epochs

0 1500 3000 Epochs

0 1500 3000 Epochs

0 1500 3000 Epochs

NPCDDE NDDE NODE ANODE

Figure 5: The training losses and the test losses of the piecewise-constant delay population dynamics (20) with the growth parameter a = 2.0 (the oscillation regime, top) and a = 3.2 (the chaos regime, bottom), respectively, by using the NPCDDEs, the NDDEs, the NODEs, and the Agumented NODEs (where the augmented dimension equals to 1). The panels in the first column depict the training losses. The panels from the second column to the fifth column depict the test losses over the time intervals, respectively, with the lengths 1, 2, 5, and 10.

follow the setup in the work (Zhu, Guo, and Lin 2021). For a fair comparison, we construct all models without augmenting the input space, and for the NDDEs, we assume that the initial function keeps constant (i.e., the initial function ϕ(t) = input for t 0), which is different from the initial function used for the NDDEs in (Zhu, Guo, and Lin 2021). We note that our models are orthogonal to these models, since one can also augment the input space and model the initial state as the feature of an NODE in the framework of NPCDDEs. Additionally, the number of the parameters for all models are almost the same (84k params for MNIST, 107k params for CIFAR10 and SVHN). Notably, the vector fields of all the models are parameterized with the convolutional architectures (Dupont, Doucet, and Teh 2019; Zhu, Guo, and Lin 2021), where the arguments that appeared in the vector fields are concatenated and then fed into the convolutional neural networks (CNNs). For example, for the NDDEs, the vector field is f(concat(z(t), z(t τ)), θ), where concat( , ) is a concatenation operator for two tensors on the channel dimension. Moreover, the initial states for these models are just the images from the datasets. It is observed that our models outperform the baselines on these datasets. The detailed test accuracies are shown, respectively, in Tab. 1. For the specific training configurations for all the models and more experiments equipped with augmentation (Dupont, Doucet, and Teh 2019), please refer to the supplementary material.

Discussion As shown above, the NPCDDEs achieve good performances not only on the 1-d PCDDE example but on the image datasets as well. However, such NPCDDEs are not the perfect framework, still having some limitations. Here, we suggest several directions for future study, including: 1) For an NPCDDE, seeking a good strategy to determine the number of the skip connections and the specific value of each delay for different tasks, 2) applying the NPCDDEs to the other suitable real-world datasets, such as the time series with the piecewise-constant delay effects, 3) providing more analytical results for the NPCDDEs to guarantee the stability and robustness, and 4) leveraging the optimal control theory

CIFAR10 MNIST SVHN NODE 53.92% 0.67 96.21% 0.66 80.66% 0.56 NDDE 55.69% 0.39 96.22% 0.55 81.49% 0.09 NPCDDE2 56.03% 0.25 97.32% 0.30 82.63% 0.36 UNPCDDE2 56.22% 0.42 97.43% 0.13 82.99% 0.23 NPCDDE3 56.34% 0.51 97.34% 0.10 82.38% 0.35 UNPCDDE3 56.09% 0.37 97.52% 0.14 83.19% 0.32 NPCDDE5 56.59% 0.44 97.40% 0.19 82.62% 0.69 UNPCDDE5 56.73% 0.54 97.69% 0.13 83.45% 0.38

Table 1: The test accuracies with their standard deviations over 5 realizations of different models on the image datasets. In the first column, the integer i in NPCDDEi or UNPCDDEi means that n = i for the NPCDDEs in (13) or for the UNPCDDEs in (14). The results for the NODEs and NDDEs are reported in (Zhu, Guo, and Lin 2021). The final time T for all models is assigned as 1.

(Pontryagin 1987) for dynamical systems to further promote the performance of neural networks.

In this article, we have articulated a framework of the NPCDDEs, which is mainly inspired by several previous frameworks, including the NODEs, the NDDEs, and the PCDDEs. The NPCDDEs own not only the provable capability of universal approximation but also the outstanding power of nonlinear representations. Also, we have derived the backward gradients along with the adjoint dynamics for the NPCDDEs. We have emphasized that both the Res Nets and the NODEs are the special cases of the NPCDDEs and that the NPCDDEs are of a more general framework compared to the existing models. Finally, we have demonstrated that the NPCDDEs outperform the several existing frameworks on representative image datasets (MNIST, CIFAR10, and SVHN). All these suggest that integrating the elements of dynamical systems with different kinds of neural networks is indeed beneficial to creating and promoting the frameworks of deep learning using continuous-depth structures.

Acknowledgments We thank the anonymous reviewers for their valuable and constructive comments that helped us to improve the work. To whom correspondence should be addressed: Q.Z. and W.L, https://faculty.fudan.edu.cn/wlin/zh CN/index.htm. Q.Z is supported by the STCSM (No. 21511100200) and by the Shanghai Postdoctoral Excellence Program (No. 2021091). W.L. is supported by the National Key R&D Program of China (No. 2018YFC0116600), by the National Natural Science Foundation of China (Nos. 11925103 and 61773125), and by the STCSM (Nos. 19511132000, 19511101404, and 2021SHZDZX0103).

References Carvalho, L. A. V.; and Cooke, K. L. 1988. A nonlinear equation with piecewise continuous argument. Differential Integral Equations, 1(3): 359 367. Chalvidal, M.; Ricci, M.; Van Rullen, R.; and Serre, T. 2020. Go with the Flow: Adaptive Control for Neural ODEs. ar Xiv preprint ar Xiv:2006.09545. Chang, B.; Chen, M.; Haber, E.; and Chi, E. H. 2019. Antisymmetric RNN: A dynamical system view on recurrent neural networks. ar Xiv preprint ar Xiv:1902.09689. Chang, B.; Meng, L.; Haber, E.; Tung, F.; and Begert, D. 2017. Multi-level residual networks from dynamical systems view. ar Xiv preprint ar Xiv:1710.10348. Chen, R. T.; Rubanova, Y.; Bettencourt, J.; and Duvenaud, D. K. 2018. Neural ordinary differential equations. In Advances in neural information processing systems, 6571 6583. Chiescu, I.; and Gdea, T. 2011. Solution of the functional equation ff=g for non injective g. UPB Scientific Bulletin, Series A: Applied Mathematics and Physics, 73(2): 756 758. Cooke, K. L.; and Wiener, J. 1991. A survey of differential equations with piecewise continuous arguments. In Delay Differential Equations and Dynamical Systems, 1 15. Springer. De Brouwer, E.; Simm, J.; Arany, A.; and Moreau, Y. 2019. GRU-ODE-Bayes: Continuous Modeling of Sporadically Observed Time Series. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alch e-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32, 7379 7390. Curran Associates, Inc. Deng, R.; Chang, B.; Brubaker, M. A.; Mori, G.; and Lehrmann, A. 2020. Modeling continuous stochastic processes with dynamic normalizing flows. ar Xiv preprint ar Xiv:2002.10516. Dupont, E.; Doucet, A.; and Teh, Y. W. 2019. Augmented neural odes. In Advances in Neural Information Processing Systems, 3140 3150. E, W. 2017. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1): 1 11. E, W.; Han, J.; and Li, Q. 2019. A mean-field optimal control formulation of deep learning. Research in the Mathematical Sciences, 6(1): 10.

Erneux, T. 2009. Applied delay differential equations, volume 3. Springer Science & Business Media.

Evans, L. C. 1983. An introduction to mathematical optimal control theory version 0.2. Lecture notes available at http://math. berkeley. edu/ evans/control. course. pdf.

Fang, L.; Lin, W.; and Luo, Q. 2018. Brain-Inspired Constructive Learning Algorithms with Evolutionally Additive Nonlinear Neurons. International Journal of Bifurcation and Chaos, 28(05): 1850068 .

Finlay, C.; Jacobsen, J.-H.; Nurbekyan, L.; and Oberman, A. 2020. How to train your neural ODE: the world of Jacobian and kinetic regularization. In International Conference on Machine Learning, 3154 3164. PMLR.

Grathwohl, W.; Chen, R. T.; Bettencourt, J.; Sutskever, I.; and Duvenaud, D. 2018. Ffjord: Free-form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367.

Haber, E.; and Ruthotto, L. 2017. Stable architectures for deep neural networks. Inverse Problems, 34(1): 014004.

Han, J.; Jentzen, A.; and Weinan, E. 2018. Solving highdimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34): 8505 8510.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770 778.

Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700 4708.

Jayasree, K. N.; and Deo, S. G. 1992. On piecewise constant delay differential equations. Journal of Mathematical Analysis and Applications, 169(1): 55 69.

Jia, J.; and Benson, A. R. 2019. Neural jump stochastic differential equations. ar Xiv preprint ar Xiv:1905.10403.

Kelly, J.; Bettencourt, J.; Johnson, M. J.; and Duvenaud, D. 2020. Learning differential equations that are easy to solve. ar Xiv preprint ar Xiv:2007.04504.

Kidger, P.; Morrill, J.; Foster, J.; and Lyons, T. 2020. Neural controlled differential equations for irregular time series. ar Xiv preprint ar Xiv:2005.08926.

Lewis, F. L.; Vrabie, D.; and Syrmos, V. L. 2012. Optimal control. John Wiley & Sons.

Li, Q.; Chen, L.; Tai, C.; and Weinan, E. 2017. Maximum principle based algorithms for deep learning. The Journal of Machine Learning Research, 18(1): 5998 6026.

Li, Q.; and Hao, S. 2018. An optimal control approach to deep learning and applications to discrete-weight neural networks. ar Xiv preprint ar Xiv:1803.01299.

Li, T.-Y.; and Yorke, J. A. 1975. Period Three Implies Chaos. The American Mathematical Monthly, 82(10): 985 992.

Li, X.; Wong, T.-K. L.; Chen, R. T.; and Duvenaud, D. 2020. Scalable gradients for stochastic differential equations. In International Conference on Artificial Intelligence and Statistics, 3870 3882. PMLR. Liu, X.; Xiao, T.; Si, S.; Cao, Q.; Kumar, S.; and Hsieh, C.- J. 2019. Neural sde: Stabilizing neural ode networks with stochastic noise. ar Xiv preprint ar Xiv:1906.02355. Lu, Y.; Zhong, A.; Li, Q.; and Dong, B. 2018. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In International Conference on Machine Learning, 3276 3285. PMLR. Mackey, M.; and Glass, L. 1977. Oscillation and chaos in physiological control systems. Science, 197(4300): 287 289. Massaroli, S.; Poli, M.; Park, J.; Yamashita, A.; and Asama, H. 2020. Dissecting neural odes. ar Xiv preprint ar Xiv:2002.08071. Pathak, J.; Hunt, B.; Girvan, M.; Lu, Z.; and Ott, E. 2018. Model-Free Prediction of Large Spatiotemporally Chaotic Systems from Data: A Reservoir Computing Approach. Physical Review Letters, 120(2). Pontryagin, L. S. 1987. Mathematical theory of optimal processes. CRC press. Radovanovic, M. 2007. Functional equations. The IMO Compendium, imomath. com/index. php. Rempe, D.; Birdal, T.; Zhao, Y.; Gojcic, Z.; Sridhar, S.; and Guibas, L. J. 2020. Ca SPR: Learning Canonical Spatiotemporal Point Cloud Representations. ar Xiv preprint ar Xiv:2008.02792. Rice, R.; Schweizer, B.; and Sklar, A. 1980. When is f (f (z))= az2+ bz+ c? The American Mathematical Monthly, 87(4): 252 263. Rubanova, Y.; Chen, R. T. Q.; and Duvenaud, D. K. 2019. Latent Ordinary Differential Equations for Irregularly Sampled Time Series. In Wallach, H.; Larochelle, H.; Beygelzimer, A.; d'Alch e-Buc, F.; Fox, E.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 32, 5320 5330. Curran Associates, Inc. Ruthotto, L.; and Haber, E. 2019. Deep neural networks motivated by partial differential equations. Journal of Mathematical Imaging and Vision, 1 13. Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Ermon, S.; and Poole, B. 2020. Score-Based Generative Modeling through Stochastic Differential Equations. ar Xiv preprint ar Xiv:2011.13456. Sun, Y.; Zhang, L.; and Schaeffer, H. 2020. Neupde: Neural network based ordinary and partial differential equations for modeling time-dependent data. In Mathematical and Scientific Machine Learning, 352 372. PMLR. Sun, Y.-Z.; Leng, S.-Y.; Lai, Y.-C.; Grebogi, C.; and Lin, W. 2017. Closed-loop control of complex networks: A trade-off between time and energy. Physical review letters, 119(19): 198301. Tang, Y.; Kurths, J.; Lin, W.; Ott, E.; and Kocarev, L. 2020. Introduction to Focus Issue: When machine learning meets

complex systems: Networks, chaos, and nonlinear dynamics. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(6): 063151. Yang, G.; Huang, X.; Hao, Z.; Liu, M.-Y.; Belongie, S.; and Hariharan, B. 2019. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE International Conference on Computer Vision, 4541 4550. Zhang, D.; Zhang, T.; Lu, Y.; Zhu, Z.; and Dong, B. 2019. You only propagate once: Painless adversarial training using maximal principle. ar Xiv preprint ar Xiv:1905.00877, 2(3). Zhang, H.; Gao, X.; Unterman, J.; and Arodz, T. 2020. Approximation capabilities of neural ODEs and invertible residual networks. In International Conference on Machine Learning, 11086 11095. PMLR. Zhu, Q.; Guo, Y.; and Lin, W. 2021. Neural Delay Differential Equations. In International Conference on Learning Representations. Zhu, Q.; Ma, H.; and Lin, W. 2019. Detecting unstable periodic orbits based only on time series: When adaptive delayed feedback control meets reservoir computing. Chaos, 29(9): 93125 93125.