# dynamicsinspired_neuromorphic_visual_representation_learning__89b9c40d.pdf

Dynamics-inspired Neuromorphic Visual Representation Learning

Zhengqi Pei 1 2 Shuhui Wang 1 3

This paper investigates the dynamics-inspired neuromorphic architecture for visual representation learning following Hamilton s principle. Our method converts weight-based neural structure to its dynamics-based form that consists of finite sub-models, whose mutual relations measured by computing path integrals amongst their dynamical states are equivalent to the typical neural weights. Based on the entropy reduction process derived from the Euler-Lagrange equations, the feedback signals interpreted as stress forces amongst submodels push them to move. We first train a dynamics-based neural model from scratch and observe that this model outperforms traditional neural models on MNIST. We then convert several pre-trained neural structures into dynamics-based forms, followed by fine-tuning via entropy reduction to obtain the stabilized dynamical states. We observe consistent improvements in these transformed models over their weight-based counterparts on Image Net and Web Vision in terms of computational complexity, parameter size, testing accuracy, and robustness. Besides, we show the correlation between model performance and structural entropy, providing deeper insight into weight-free neuromorphic learning.

1. Introduction

A biological brain learns by both the structural evolution via rewiring neural pathways (Chklovskii et al., 2004) and the numerical evolution via strengthening/weakening neural connections (Cho et al., 2015). Following the rule of biological neurons, the artificial neural networks (ANNs) mimic the biological brain with neurons organized in a fixed lay-

1Institute of Computing Technology, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Peng Cheng Laboratory. Zhengqi Pei <peizhengqi22@mails.ucas.ac.cn>. Correspondence to: Shuhui Wang <wangshuhui@ict.ac.cn>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

ered structure as the fully connected neural network (Hinton et al., 2006), CNN (Krizhevsky et al., 2017) and Transformers (Liu et al., 2021). In real applications such as image categorization, different neural structures lead to varying performance on widely-used datasets (Golubeva et al., 2020) such as Image Net (Deng et al., 2009). Despite the great success, ANNs have several intrinsic drawbacks, e.g., the requirement of a massive number of parameters, gradient vanishing or explosion (Pascanu et al., 2013), and redundant computations (Roy Chowdhury et al., 2018). The fixed structure of ANNs is considered suboptimal to approximate the brain precisely and efficiently (Han et al., 2021).

It has been proved that evolving neural mechanisms, such as Neuro Evolution (Stanley & Miikkulainen, 2002) and neural architecture search (NAS) (Elsken et al., 2019) that alter the layers and parameters, can significantly outperform their counterparts with fixed structures (Assunc ao et al., 2018). Nonetheless, these mechanisms established as dynamic neural networks (Han et al., 2021) require a large amount of expensive fitness evaluations that are inefficient for real-time learning (Stork et al., 2019) and appear unstable. As another direction of research endeavors, some attempts, e.g., weight agnostic neural network (Gaier & Ha, 2019), Spiking neural networks (Basegmez, 2014) and weight mirror (Akrout et al., 2019), reconsider the importance of neural weight and structure. They either rely on explicit neural weight computation or have yet to develop an efficient learning mechanism to, for example, overcome the difficulties of training a spiking neural network via supervised learning (Huynh et al., 2022), which are still inefficient due to the enormous search space of model structure (Ren et al., 2021). In general, the weight-based ANNs can hardly achieve unified structural and numerical evolutions.

In this study, we consider the structure and numerical learning from the neuromorphic dynamics aspect, inspired by Hebb s learning rule (Cooper, 2005), which states that neural connections between neurons with similar dynamic behaviors tend to be stronger. Rather than explicitly implementing Hebb s rule of the neural connections as weights, we represent the dynamic behaviors of neurons (neuronal dynamics) as their spatial coordinates. Neurons with similar dynamic behaviors have closer spatial coordinates, leading to naturally stronger connections. To formalize the whole mechanism, we first reinterpret the universal approximation

Dynamics-inspired Neuromorphic Visual Representation Learning

(a) Weights-based neural networks: weights are trainable variants that are explicitly isolated from each other; they are treated as the essential parameters directly affected by the feedback signals, such as predictive error during the back-propagation process.

(b) Dynamics-inspired neuromorphic system: neural weights are path integrals between neuronal dynamics.

Figure 1: Comparison between neural networks and dynamics-inspired neuromorphic system. We interpret the neurons as sub-models q(l,t) i embedded in the high-dimensional neuronal state space. The indices l and t refer to a subsystem s index and time step. In Fig. 1b, sub-models of different subsystems, corresponding to input, hidden, and output layers, are mixed in the neuronal state space. The feedforward and feedback signals push the neurons to continue moving until equilibrium. The neuronal dynamics determine the nonlinear spatial density nearby, affecting the signals traveling among neurons in a curved/nonlinear manner. The area in pink indicates higher density, while the area in cyan indicates lower density.

theorem (UAT) (Scarselli & Tsoi, 1998) into a dynamical alternative, which claims that neural weights can be calculated as the covariant of neuronal states while retaining the approximation capacity of the whole neural system. In the dynamical UAT (Eq. 2), neurons receive the input signals that affect neuronal dynamics and emit the processed signals to other neurons. The signals between neurons amplify or decay during transmission via path integral. During training, the neurons move until their neuronal states reach equilibrium. In this way, the trainable units are the neuronal states. The neural weights between neurons are not necessarily maintained as concrete trainable units since they can be expressed as path integrals between neuronal states.

Accordingly, we propose a Dynamics-inspired Neuromorphic (Dy N) learning framework where the trainable parameters are the neuronal dynamics (Figure 1b). Dy N applies dynamics-based updating rules on finite sub-models, i.e., the functional neurons receiving and emitting signals with neuronal states changing dynamically. The model learning and inference are undertaken via computing the path integral between the neuronal dynamics. Computationally, it allows one to change coordinates of distinct neurons efficiently with functional integrals (Weinberg, 1995), facilitating a more global interaction among neurons, compared to the layer-by-layer update rules used in deep learning architectures (Pascanu et al., 2013).

Experimentally, we first validate Dy N on MNIST (Deng, 2012) to verify its capacity as a universal approximator. As in the case of Le Net-5 (Le Cun et al., 2015), we observe that when we transform both the convolutional and fullyconnected layers to their Dy N form, the parameters have

been reduced by 10 30, yet the transformed Dy N models outperform the original Le Net-5 by 0.2%. Then, we transform the weight-based layers of many pre-trained deep neural models, e.g., Dense Net (Huang et al., 2017) and Swin T (Liu et al., 2021), to the Dy N alternatives, followed by fine-tuning on Image Net (Deng et al., 2009). As a result, the parameters have been reduced by 5 10. Still, the transformed models outperform the original ones on both Image Net and Web Vision (Li et al., 2017). These observations reveal the potential of building a more efficient neural computing architecture focusing on neuronal dynamics rather than weight transport.

We also consider practical implementation issues assuming that existing computing devices naturally contain noises (Braverman et al., 2015), e.g., the quantization error of digitization, affecting the parameter precision during storage and calculation. Specifically, we assume a model s parameter space is reconstructed by noise uniformly on an ϵ-ball. We round all parameters to a certain precision, e.g., round a 5-digit value to a 2-digit value, to see how the performance of the quantized model is affected. The results indicate that our model is robust to different noise levels, suggesting that we can accelerate the model via hashing without a significant loss of accuracy. Codes are available1.

2. Preliminaries

Interpreting ANN as a dynamical system. A typical ANN comprises neurons organized in a specific structure and trainable neural weights connecting them. The weights

1https://github.com/pzqpzq/flat-learning

Dynamics-inspired Neuromorphic Visual Representation Learning

among different neurons keep updating during training with layer-by-layer updating rules. According to the dynamics theory, an ANN is a dynamic system where the neurons dynamically interact towards a minimal objective function, e.g., the cross-entropy loss. However, ANN s fixed structure seems suboptimal because it constrains the learning toward a universally optimal network configuration and imposes an unnecessary computational burden on the training and inference. In comparison, our method is straightforward. By treating each neuron state as a basic trainable unit, we replace the neural weights between neurons with the dynamic interaction between neuronal dynamics. Neural weight values are measurements of the transient interaction between neurons, which are directly accessible from the dynamical states of neurons. Neurons are allowed to interact fully with each other during training, facilitating a comprehensive release of model capacity.

Sub-models and subsystems. A neuron s dynamical state, e.g., spatial location, velocity, acceleration, activation/inhibition, etc., determines its spatial coordinates in a d-dimensional phase space. Neurons with similar behaviors are located closely in the phase space. We call these dynamical neurons as sub-models to distinguish them from node-like neurons in the computational sense. A sub-model is a functional neuron receiving and/or emitting signals and changing its dynamical states. A group of sub-models sharing identical global settings, e.g., a hidden layer in an MLP or a convolutional layer in a CNN, refers to a subsystem.

We interpret the dynamical states of a sub-model with index i as a time-variant embedding: q(l,t) i Rd, where t refers to the time-step, and l refers to the index of subsystem that contains the sub-model. In Figure 2, we define two vector fields E(l,t) i , R(l,t) i : Rd 7 Rs. For instance, E(l,t) i converts a direction v Rd into a signal E(l,t) i (v) Rs. Intuitively, we have R(t ) j (0) = P

i S(t,t ) ij (q(t ) j ) and

E(t) i (0) = P

j S(t,t ) ij (q(t) i ). We define R(t ) j (0) R(t ) j and E(t) i (0) E(t) i for simplicity, and we set E(t) i (u) = E(t) i (v), u = v, assuming that a sub-model emits signal isotropically in any direction. We also define a mapping S(t,t ) ij : Rd 7 Rs that describes how the signal emit-

ted from q(t) i is varying along the path towards q(t ) j , e.g.,

S(t,t ) ij (v) = E(t) i v q(t) i p/ q(t ) j q(t) i p.

Accordingly, we can transform any tensor-formed neural layer into subsystems with a specified topology. As presented in Table 1, a sub-model refers to a neuron for an MLP s fully-connected layer MF C. A convolution layer MC (kernel s window k k, with Nin and Nout channels) refers to 2k subsystems, each containing Nin + Nout submodels. An attention layer with MQ, MK RT dk and MV RT dv, where dk and dv are the hidden dimensions

Figure 2: Signal is varying along with the nonlinear path between sub-models.

and T is the sequence length, refers to two subsystems containing 2dk + dv and T sub-models, respectively. We list the terminologies of a typical DNN to the related concepts of Dy N in Appendix B.

Table 1: From neural layer to Dy N. We denote P(x) as a subsystem containing x sub-models.

Models Layer Types Dy N Types

MLP MF C Rm n P(m)+P(n)

CNN MC Rk k Nin Nout 2k P(Nin + Nout)

Transformer

MQ RT dk P(2dk + dv)+P(T) MK RT dk MV RT dv

Universal Approximation Theorem (UAT). This part focuses on Cybenko s arbitrary-width UAT (Cybenko, 1989), which demonstrates the approximation capabilities of a feedforward neural network in the space of continuous functions between two Euclidean spaces. It states that, σ : R 7 R is not polynomial if and only if for every n N, m N, compact K Rn, f : K 7 Rm, ε > 0, there exists k N, A Rk n, b Rk and C Rm k such that

sup x K f(x) g(x) < ε (1)

where g(x) = C (σ(A x+b)). There are various UAT for the arbitrary-depth case and some widely used architectures like convolutional neural networks. However, they all rely on a setting in that the trainable units are neural weights between neurons. Next, we will present an alternative to this typical UAT by considering the neuronal states rather than their weights as the trainable units.

Principle of dynamic subsystems. Based on Cybenko s UAT, we propose a dynamical UAT with trainable neuronal states. The dynamical UAT can approximate a time-variant sequential function. It states that, for d, s, M, N N, given a system of sub-models with a set of time-variant coordinates {q(t) i Rd, i [1, N]} that receive and emit time-variant signals R(t) i Rs and E(t) i Rs, then for

Dynamics-inspired Neuromorphic Visual Representation Learning

arbitrary nonlinear sequential mapping Rs M 7 Rs M

between R(t) i and E(t) i for any i [1, N], there exists a set of matrices A Rs s, B Rs d, C Rs d, D Rd s, E Rd s and F Rd d, such that for t [1, M]:

j =i E(t ϵ) j φ(q(t ϵ) j , q(t) i )

E(t) i = AR(t) i + Bq(t) i + C d

d dtq(t) i = DR(t) i + EE(t) i + Fq(t) i

where a non-polynomial bi-linear mapping φ : Rd Rd 7 R is used to compute the path integral between sub-models.

The proof is presented in Appendix C. The dynamical UAT considers neural weights as the covariants of the trainable neuronal states, implying the equivalence between a weightbased neural structure and a neuron-state-based one. The time-invariant parameter matrices {A, ..., F} describe a submodel s emitting and receiving mechanism. There is no activation function in Eq. 2 because φ already introduces nonlinearity. As a result, the dynamics amongst trainable neurons are sufficient to approximate arbitrary time-variant sequential functions on a specific metric function φ. Despite that, one can still use the nonlinear activation functions on the summation of received signals in practical implementation to introduce more nonlinearity and further enhance the stability of the whole learning system.

Formulation of a Dy N system. Based on Eq. 2, we formalize the Dy N system that concentrates on learning neuronal dynamics rather than neural weights. The proposed Dy N system contains subsystems {P (l), l [1, L]} interpreted as nodes in a directed graph G , and the directed edge from P (l ) to P (l) indicates that the emitted signals of P (l ) are received by P (l). Each subsystem contains finite time-variant sub-models with d-dimensional embeddinglike dynamical states {q(l,t) i Rd, i [1, Nl], Nl N}. Each sub-model can emit signals E(l,t) i Rs and receive signals R(l,t) i Rs. We assume t = t and S(t,t ) ij = S(t) ij for simplicity (see Appendix D). Then we generalize the dynamics among subsystems as follows:

j=1 Ψ(l) R (E(l ,t) j , q(l ,t) j , q(l,t) i )

E(l,t) i = Ψ(l) E (R(l,t) i , q(l,t) i ,

tq(l,t) i )

tq(l,t) i = Ψ(l) Q (R(l,t) i , E(l,t) i , q(l,t) i )

where the nonlinear Ψ(l) R , Ψ(l) E : Rs Rd Rd 7 Rs and Ψ(l) Q : Rs Rs Rd 7 Rd describe the I/O properties of a

subsystem l, and l indicates the index of a subsystem that links to P (l) in graph G . Intuitively, Eq. 3 describes a dynamic system where a sub-model interacts with its adjacent sub-models by transmitting signals.

3. The Dy N mechanism

We first show how to convert a fully-connected layer of Ra b into a + b sub-models. Then we generalize this mechanism for arbitrary neural layers and show the equivalence between Dy N mechanism and entropy reduction.

Interpreting an FC layer as neuronal path integrals. For a feed-forward neural network, the l-th fully-connected layer with M input neurons and N output neurons is a pre-trained weight matrix T RM N. Here, we present a method to represent T as two subsystems of d-dimensional sub-models: P (l) = {q(l) i , i [1, M]} and P (l+1) = {q(l+1) j , j [1, N]}. The received signals of P (l) are

[R(l,t) 1 , ..., R(l,t) M ] = R(l,t) R1 M, and the emitted signals of P (l+1) are [E(l+1,t) 1 , ..., E(l+1,t) N ] = E(l+1,t) R1 N. Our goal is to establish the subsystem-related mapping Ψs (see Eq. 3) such that R(l,t)T E(l+1,t) with P (l) and P (l+1). We set ΨQ = 0 for the inference stage with fixed sub-models. Generally, we interpret Ψs as follows (we abbreviate the input arguments as for convenience):

Ψ(l+1) R ( ) = E(l,t) j φ(q(l,t) i , q(l+1,t) j )

Ψ(l) E ( ) = R(l,t) i ; Ψ(l+1) E ( ) = R(l+1,t) j

Ψ(l) Q ( ) = 0; Ψ(l+1) Q ( ) = 0

where the metric function φ refers to a nonlinear path integral, which is not a computation-friendly formulation. Instead, we can convert the nonlinear φ into the weighted sum of multiple linear relations and efficiently deal with the non-linearity by splitting each sub-model into H copies, i.e., q(l,t) i {q(l,t) ih , h [1, H]}. Unless specified, we initialize H as (d) 1 MN (M + N) 1. Next, we define the nonlinear relations between sub-models via the Lp-norm:

φ(q(l) i , q(l+1) j ) =

h=1 µ(l;l+1) h q(l) ih q(l+1) jh p (5)

where p N+, and µ(l;l+1) h R is a trainable shared coefficient related to q(l) h and q(l+1) h . The total amount of sub-models is flexible during training, as we can merge a set of sub-models with similar dynamical behaviors. For example, q(lx,t) ihx and q(ly,t) jhy can be merged at time-step T if PT t=T 5 q(lx,t) ikx q(ly,t) jky 0.1. Once the adjacent submodels are merged, their subsequent dynamical behaviors will be synchronous and stored as a single sub-model. In the next section, we will show how to find proper P (l) and P (l+1) with Eq. 4 and Eq. 5.

Dynamics-inspired Neuromorphic Visual Representation Learning

Training a fixed FC layer with Dy N mechanism. We describe the Dy N mechanism to learn P (l) and P (l+1) such that R(l,t)T E(l+1,t) with Ψs defined in Eq. 4 and Eq. 5. We denote vij;h = q(l) ih q(l+1) jh , and the stress force Fij =

Tij φ(q(l) i , q(l+1) j ) between sub-models under the target T. A sub-model moves in the direction of its stress force:

q(l) ih t = µ(l;l+1) h

vij;h vij;h p Fij

q(l+1) jh t = µ(l;l+1) h

vij;h vij;h p Fij

j=1 vij;h p Fij = 0

The last line of Eq. 6 follows the energy and momentum conservation laws, the mechanism in Eq. 6 can approximate arbitrary linear transformation and is consistent with backpropagation (Appendix G and F).

Training an MLP with Dy N from scratch. Now we proceed to train an MLP from scratch, i.e., we only know the input R(in) RNin and the target output T (out) RNout of each training sample. Given an MLP that contains two randomly initiated weights W (in;hid) RNin Nhid and W (hid;out) RNhid Nout, the transmitting signals along with each layer are defined as E(hid) = σ(R(in)W (in;hid)) and E(out) = σ(E(hid)W (hid;out)). Following Eq. 6, we can represent this MLP as three subsystems: P (in) = {q(in) i , i [1, Nin]}, P (hid) = {q(hid) k , k [1, Nhid]}, and P (out) = {q(out) j , j [1, Nout]}. The path integral between sub-models of P (x) and P (y) respectively is denoted by φ(xy) RNx Ny. We denote Φ(xy) j = P

i σ(R(x) i ) v(xy) ij as the signals received by q(y) j from

all the sub-models in P (y), where v(xy) ij = q(x) i q(y) j . Our goal is to update the sub-models such that T (out) = σ(σ(R(in)φ(in;hid))φ(hid;out)). Using back-propagation to replace Fij in Eq. 6 with the gradients regarding neural weights, we can update the sub-models as follows:

q(in,t) i;h t µ(in;hid) h Φ(in) i Φ(hid;in) i

q(out,t) j;h

t µ(hid;out) h Φ(out) j Φ(h;out) j

q(hid) k;h t µ(in;hid) h Φ(in;hid) j + µ(hid;out) h Φ(out;hid) j

where Φ(in) i = σ(R(in) i ) and Φ(out) j = T (out) j E(out) j . The coefficients µh are updated recursively using Eq. 6, i.e., we first update the sub-models, then reconstruct the nonlinear φ and obtain the gradients of Φ(out) 2 to φ. The resulting gradients are the stress F updating µh as in Eq. 6.

Converting general tensor data flow into signals. The previously introduced cases focus on constructing Dy N mechanism based on a fixed neural structure, i.e., the prior knowledge that tells a sub-model exactly it should process which signals have been provided. However, to build a Dy N system from scratch without any reference neural structure, a sub-model should be able to distinguish the signals to be processed from all the received signals. Therefore, we need more signal components to store the positional features. For example, as presented in Fig. 3, a sub-model q11 in 2-dimensional neural state space emits signal E11 that contains the unique ID of q11 by appending a positional encoding [0, 0] on its original signal [X11], then the other sub-models can know that E11 is emitted by q11 based on the positional features.

Figure 3: Converting a tensor into signals emitted by sub-models. Suppose we have a tensor input [Xij] R3 3, then there could be 9 sub-models receiving the tensor input. Each sub-model qij receives signals Xij and emits signals Eij that contain the positional features.

Generally, for a tensor of rank r, i.e., X Rd1 ... dr, we use at most Qr k=1 dk sub-models to receive the signals referring to X. A sub-model receives a signal that refers to an entry of X and emits a signal of R1+r that contains the rdimensional positional feature. There are many techniques to reduce the number of sub-models necessary to represent the tensor input. Let s denote Vx as a set of reduced X by specifying the index of its x-th dimension, e.g., for X R3 3 2, we have V2[y] = X[:, y, :]. If we observe that the components of Vx are similar (using a method like KLdivergence), then we can reduce the sub-models along with the x-th dimension, i.e., we need only Qr k =x dk sub-models.

For a neural structure with hidden layers, we might need a larger shape of signals, e.g., Rr+2, to store the layer-wise feature that shows the signals come from which hidden layers. The formulation of positional features is not unique, and they are generally concatenated with the original signal vectors. Based on the trained positional features, the subsystem-related function Ψs can tell the sub-model to receive signals from which sub-models. The target label T (out) RNout corresponds to Nout sub-models, each emitting signal E(out) i Rs. Then we obtain the normalized

Dynamics-inspired Neuromorphic Visual Representation Learning

feedback signal Φ(out) i = T (out) i E(out) i p, where s and p are hyper-parameters introduced in previous sections.

Training arbitrary layer with Dy N. The Eq. 7 can be seen as dynamical back-propagation, which implicitly updates neural connections while minimizing the stress force between neuronal states, see Appendix F. To train arbitrary layer, e.g., convolution or self-attention, with Dy N mechanism, we first obtain the dummy states by computing the path integrals amongst neurons, recovering the weight-based layers according to Table 1. Next, we compute the dummy stress force (gradient descent) and apply Eq. 6 to reduce the stress force. See Alg. 1 for the whole procedure.

Algorithm 1 Dy N learning for general neural structure

Input: Neuronal dynamics Q, Desired Output T repeat

Dummy States A = Rel(Q, Q) via Eq. 5 Stress force F = Grad(A, T) via Eq. 7 Update Q = Reduce(F, Q) via Eq. 6 until Q reaches equilibrium

There are some practical tips to reduce computational complexity and memory. For example, one can cluster dynamically similar sub-models. The dynamical states of submodels can be encoded as a sparse matrix via methods like vector quantization (see Eq. 12). The lossy dynamical states with an exact resolution 1/δ can be implemented via a Cantor expansion Rd 7 N [1, δ d], where δ d should not exceed the maximal allowable integer allowed in the current computing system, e.g., 1.8 10108 for any floating-point number represented as a 64-bit double-precision value.

Inference stage for a Dy N system. The inference stage involves the specific matrix-vector query designed for distance matrices via faster linear algebra (Indyk & Silwal, 2022). Now we present how to feedforward an MLP on a Dy N system with L1-norm. Given an MLP that contains weights W (in;hid) RNin Nhid and W (hid;out) RNhid Nout. Its Dy N alternative has three subsystems: P (in) = {q(in) i , i [1, Nin]}, P (hid) = {q(hid) k , k [1, Nhid]} and P (out) = {q(out) j , j [1, Nout]}. The input signals R(in) RNin are received by P (in), which emits E(in) = R(in) to P (hid). Then P (hid) receives R(hid) RNhid from P (in) and emits E(hid) = σ(R(hid)) to P (out). Finally, P (out) receives R(out) RNout from P (hid) and emits E(out) = R(out). Specifically, the inference stage that converts E(in) to R(hid)

is as follows:

R(hid)[x] =

i=1 E(in) i φ(q(in) i , q(hid) x )

i=1 E(in) i |q(in) ih [y] q(hid) xh [y]|

Let s denote π+ y as a set of i such that q(in) ih [y] q(hid) xh [y], and likewise, denote π y . We now focus on the inner loop and rearrange it as follows:

i π+ y (q(in) ih [y] q(hid) xh [y])+

i π y (q(hid) xh [y] q(in) ih [y])

= qx[y] (Zneg Zpos) + pos neg

where Zneg = P

π y E(in) i , Zpos = P

π+ y E(in) i , pos = P

π+ y E(in) i q(in) ih [y], and neg = P

π y E(in) i q(in) ih [y] are preprocessed values. This method reduces the computational complexity of matrix-vector query from O(Nin Nhid) to O(Hd max(Nin, Nhid)). Similarly, we can compute the inference stage of any tensor-based layer containing n neurons on a Dy N system with Lp-norm using Eq. 8 and Eq. 9. This mechanism roughly reduces these tensor-based inference stages from O(n2) to O(Hndp), where H is the number of copies as presented in Eq. 5.

Understanding Dy N mechanism as entropy-reduction. Updating the dynamical states of the sub-models to approximate arbitrary mapping in Eq. 7 is equivalent to reducing the structural entropy among the sub-models. Based on the Euler-Lagrange equation, we can conclude that (see Appendix G for details):

q(t) i t = ηi exp

k =i t L(t) ik P

k =i L(t) ik t

where ηi R is a trainable parameter related to q(t) i , and L(t) ik is a Lagrangian that measures the energy flow for q(t) k . The form of L(t) ik is not unique. For example, it can be:

L(t) ik = 1

2mi 2q(t) i t2 Ui S(t) ki (q(t) i ) (11)

where mi R and Ui Rd s are trainable parameters, referring to the mass and potential energy of a sub-model i. Intuitively, a sub-model tends to move toward the region with a lower structural entropy of the energy distribution, and the dynamical signals can be regarded as packets of energy (visualized in Figure 6 of Appendix H). This result verifies the potential prospect that a well-formed Dy N system can be updated via global dynamics.

Issues with Activation functions. The input-to-output function of a sub-model is non-linear even without an activation function, because the path integral of Dy N has already induced non-linearity. Recall that when a signal is transmitted from a sub-model to another, it will be multiplied by the path integral between these two sub-models, and the path integral

Dynamics-inspired Neuromorphic Visual Representation Learning

Table 2: Evaluating weight-based neural models and their Dy N forms on MNIST. We evaluate each model s original form on MNIST and convert its layers into Dy N forms trained using Dy N mechanisms. We also test the case of training Le Net-5 with Eq. 6 only, i.e., we simply convert the layers of a pre-trained Le Net-5 into Dy N forms without further training.

MODEL LAYER TYPE NO.COPIES NO.PARAMS TEST ACC. (%)

FC CONV MEMORY DISK FIXED (EQ. 6) UNFIXED (ALG. 1)

3-LAYERED NN FC - 2,290K 97.89 0.10 DYN 50 - 1360K 160K - 98.32 0.03 DYN 75 - 2170K 250K - 98.36 0.02

FC, CONV - 61.8K 99.06 0.10 DYN 2 3 14.50K 2.03K 81.44 99.13 0.10 DYN 2 5 16.48K 2.25K 84.95 99.15 0.07 DYN 3 6 23.01K 2.98K 96.28 99.21 0.05 DYN 5 8 36.04K 4.44K 98.10 99.21 0.09 DYN 7 7 46.11K 5.56K 98.83 99.23 0.06

Table 3: Evaluating weight-based neural models and their Dy N forms on Image Net and Web Vision. We convert the layers of the pretrained neural models into Dy N forms and finetune them using Alg. 1 on the training sets. All the pre-trained weight-based neural models come from torch.hub.

MODEL CONFIGS NO.PARAMS (MILLIONS) MACS (GFLOPS) IMAGENET (%) WEBVISION (%)

STRUCTURE LAYER TYPE IDEAL δ=1e 3 IDEAL δ=1e 3

DENSENET-161 FC, CONV 28.68 7.82 75.254 71.336 68.973 61.429 DYN 6.05 3.28 (0.089) 75.314 75.246 69.033 68.984

RESNET-152 FC, CONV 60.40 11.58 77.014 75.776 69.879 59.435 DYN 6.51 5.25 (3.5E-3) 77.203 76.604 70.005 69.998

VIT-S-224 FC, CONV, ATTN 36.38 1.11 80.108 80.038 72.665 72.509 DYN 3.71 0.45 (0.75E-3) 80.150 80.122 72.728 72.716

SWINT-S-224 FC, CONV, ATTN 49.94 8.52 82.634 82.070 72.755 72.604 DYN 10.38 3.35 (0.024) 82.646 82.604 72.802 72.740 DYN 6.65 2.37 (0.018) 82.688 82.660 72.934 72.842

is computed using the weighted sum of several linearities as in Eq. 5. However, a Dy N model without an explicit activation function risks non-convergence and depends heavily on a good initialization of the learnable parameters. Thus, we implement the activation function like the corresponding ANN to stabilize the implementation. When a sub-model receives signals from the other sub-models, it simply takes an activation function (Sigmoid, Re LU, or GELU) similar to its ANN counterpart on the summation of the currently received signals. Then it emits the activated summed signal to the others. The activation endows Dy N with more non-linearity and larger model capacity.

4. Experiments

Datasets and compared approaches. We evaluate Dy N on three visual classification datasets, MNIST (Deng, 2012), Image Net (Deng et al., 2009)) and Web Vision (Li et al., 2017). We first conduct experiments on MNIST to compare

a 3-layered feedforward neural network trained via backpropagation with its Dy N-formed alternative trained from scratch via Eq. 7. Likewise, we compare the Le Net-5 (Le Cun et al., 2015) with its Dy N alternative trained from scratch. Each convolutional layer is converted into Dy N forms based on the policy presented in Table 1. We further validate our approaches on Image Net and Web Vision, by converting mainstream pre-trained neural models built on Image Net training split from torch.hub, including Dense Net161 (Huang et al., 2017), Res Net-152 (He et al., 2016), Vi Ts (Dosovitskiy et al., 2020) and Swin-Transformer (Liu et al., 2021), to their Dy N forms. On all datasets, for a fair comparison, we set the model configuration of the original ANNs and their Dy N alternatives (e.g., the number of hidden units, validation criterion, SEED, etc.) to be the same. The training/testing splits of all the datasets follow the official settings. For Webvision results, we only use the testing split to report accuracy, while the finetuning is conducted on the training split of Image Net instead of Webvision.

Dynamics-inspired Neuromorphic Visual Representation Learning

Evaluation metrics and implementation details. We report the top-1 accuracy, the number of parameters, and the computational complexity that measures how many operations are needed for each model during the inference phase. We repeat the training procedure for each configuration 20 times to calculate its mean and standard deviation. In addition to the typical evaluation (e.g., ideal columns in Table 3), we also conduct experiments under varying parameter resolution 1/δ, where δ > 0 to measure a model s robustness during inference. We truncate a trainable parameter matrix X into its quantized form pδ(X) as:

pδ(Xij) = min(X) + Xij min(X)

δ J δ J (12)

where J = max(X) min(X). For an idealized weightbased or Dy N model, we set δ = 0 or δ = 1e 6. When δ = 0, pδ(Xij) = Xij. The trainable units in a weightbased ANN are the weights {Wij, i [1, a], j [1, b]}, while the ones in a Dy N model are the input neuronal states Qi Rd a and output neuronal states Qj Rd b. The computational complexities of a weight-based model and its Dy N alternative are measured using Multiply-Accumulate units (MACs), i.e., the number of FLOPs (Patil & Kulkarni, 2018). For a Dy N model, we also compute its computational complexity in a physically meaningful way (denoted in brackets next to the MACs), which assumes that the computation of path integral amongst sub-models is executed instantaneously (a phenomenon that exists in the spatiotemporal liquid crystal structures (Zhang et al., 2021)). The dimension d defaults to 9 unless otherwise noted. Note that d can still be tuned for numerical model optimization with higher/lower values. We fine-tune the models with one NVIDIA RTX3090 24GB GPU on a cloud server. The inference stage is implemented on a laptop with 32GB memory.

4.1. Visual classification

The main results on three datasets are presented in Table 2 and Table 3. Compared against feedforward neural networks and Le Net-5, our randomly initialized Dy N models trained from scratch via Alg. 1 demonstrate higher accuracy, lower computational complexity, and reduced parameter size. Then we use several pre-trained models as backbone networks and convert their FC, convolution, and attention layers into Dy N forms. The final dynamical states of the sub-models are determined by fine-tuning the transformed neural models on Image Net s training set. This process continues until the stress force amongst sub-models is lower than a certain threshold, e.g., 10 3 of the normalized distances between sub-models. We observe that the Dy N alternative of each neural model achieves significant improvement in accuracy, especially under lower parameter resolution, i.e., a higher δ value, see Figure 5 and Appendix I. Besides, a neural model with more neural blocks transformed via Dy N mechanism performs better than the one

with less Dy N blocks, e.g., swin Dy N in Table 3 and Figure 8a. These results show that Dy N mechanism preserves more information efficiently by encouraging all-dynamic neuron interaction.

4.2. Relation to the asymmetric convolutions

Given a convolutional layer MC Rk k Nin Nout, we have two policies to convert MC into its Dy N form. The first is the k2 policy: converting MC into k2 subsystems {P (11)(n), P (12)..., P (kk)(n)}, where n = Nin + Nout, such that φ(q(ij) k , q(ij) l ) = MC[i, j, k, l]. The second is the 2k policy (Table 1): converting MC into 2k subsystems {P (in,1)(n), ..., P (in,k)(n), P (out,1)(n), ..., P (out,k)(n)}, making each subsystem correspond to an input or output channel, such that φ(q(in,i) k , q(out,j) l ) = MC[i, j, k, l]. We notice that the 2k policy significantly reduces the parameters while preserving the accuracy in Table 4 compared with the k2 policy. The experiments use the 2k policy to convert a convolutional layer.

Table 4: Comparison of k2 and 2k policies in convolution.

KERNEL NO.COPIES NO.PARAMS TEST ACC (%) FC CONV

k2 5 5 6.27 K 99.15 0.07 2k 5 5 3.97 K 99.16 0.06

k2 5 8 7.99 K 99.19 0.06 2k 5 8 4.32 K 99.20 0.08

k2 7 7 8.65 K 99.24 0.04 2k 7 7 5.43 K 99.24 0.03

k2 8 5 8.11 K 99.22 0.05 2k 8 5 5.81 K 99.23 0.04

This idea is similar to the asymmetric convolution, which converts k k convolution into stacked 1 k and k 1 ones (Szegedy et al., 2016). It implies that the parametersharing mechanism also works in Dy N and explores the neuronal covariants that induce a way to boost performance.

4.3. Correlation with structural entropy

We examine Dy N under different δ via Eq. 12. Though larger δ leads to lower parameter precision, we observe an existence of peak that corresponds to the optimal setting of δ such that a model achieves its best testing accuracy (Figure 7b in Appendix H). We postulate that the peak corresponds to some regularization effect that prevents overfitting, which is also related to the cross-entropy and the system s structural configuration. To validate our postulate, we first evaluate the new coordinates qi of sub-models with a varying δ by qi(δ) = δ qi/δ , then we count the spatial distribution of each newly resulted sub-model in terms of

Dynamics-inspired Neuromorphic Visual Representation Learning

(a) 2-layered Dy N models on MNIST

(b) Deep Dy N models on Image Net

Figure 4: Scattered points and their expectations that represent model performances for simple 2-layer Dy N model with distinct resolution on MNIST (a), and for several mainstream neural models transformed via Dy N approach on Image Net (b).

coordinates Pr(vx, δ) = |{qi|qi(δ) = vx}|/|{qi}|, where vx Rd. Moreover, we calculate the structural entropy

vx Pr(vx, δ) log Pr(vx, δ) (13)

to measure the structural disorder in terms of the system s energy distribution. To connect the structural entropy with its physical meaning, we evaluate the Laplacian of curvature κ of ψ(δ) (denoted by Lap Cur SE), which accounts for the energy of surface diffusion flow (Sethian & Chopp, 1999)

Lap Cur SE(δ) = 2

δ2 κ(ψ(δ)) (14)

and we observe that an optimal structural setting always refers to a lower Lap Cur SE, whose expected value is negatively correlated with model performance (Figure 4a). This observation implies that optimal performance requires a stable structure instantiated as a minimal surface of energy distribution: δoptimal = arg minδ Lap Cur SE(δ), which ensures that all sub-models find the dynamical states that

make them the most stable with the lowest energy. We evaluate the Lap Cur SEs of the Dy N models of several mainstream models on Image Net and observe that an optimal performance always refers to a lower Lap Cur SE (Figure 4b). We also observe that the total computational cost of neurons that follow the dynamical UAT is theoretically and experimentally conservative (Appendix H).

Figure 5: On Image Net, as the noise δ 1 increases, the accuracy of a neural model decreases, while that of the Dy N alternative almost retains.

5. Limitations

See Appendix J.

6. Conclusion

We propose a dynamics-inspired neuromorphic architecture that interprets neural representation and learning from dynamics theory. It emphasizes the state representation of the neurons rather than the neural weights. In visual classification, our architecture fully exploits each neuronal parameter, demonstrating superiority in accuracy, parameters, and computational complexity. More investigation on the correlation between model performance and structural entropy reveals that learning via structural mechanism is better than numerical mechanism in efficiency and explainability. Future work includes applying Dy N on multimodal data, new tasks (e.g., retrieval, and QA), and providing an in-depth physical interpretation of the neuronal state space.

Acknowledgements

This work was supported in part by the National Key R&D Program of China under Grant 2018AAA0102000, in part by National Natural Science Foundation of China: 62022083 and 62236008. We thank anonymous reviewers for helpful suggestions that improved this work.

Dynamics-inspired Neuromorphic Visual Representation Learning

Abarbanel, H. D. and Rouhi, A. Phase space density representation of inviscid fluid dynamics. The Physics of fluids, 30(10):2952 2964, 1987.

Akrout, M., Wilson, C., Humphreys, P., Lillicrap, T., and Tweed, D. B. Deep learning without weight transport. Advances in neural information processing systems, 32, 2019.

Assunc ao, F., Lourenc o, N., Machado, P., and Ribeiro, B. Evolving the topology of large scale deep neural networks. In European Conference on Genetic Programming, pp. 19 34. Springer, 2018.

Basegmez, E. The next generation neural networks: Deep learning and spiking neural networks. In Advanced Seminar in Technical University of Munich, pp. 1 40. Citeseer, 2014.

Braverman, M., Schneider, J., and Rojas, C. Space-bounded church-turing thesis and computational tractability of closed systems. Physical review letters, 115(9):098701, 2015.

Chklovskii, D. B., Mel, B., and Svoboda, K. Cortical rewiring and information storage. Nature, 431(7010): 782 788, 2004.

Cho, R. W., Buhl, L. K., Volfson, D., Tran, A., Li, F., Akbergenova, Y., and Littleton, J. T. Phosphorylation of complexin by pka regulates activity-dependent spontaneous neurotransmitter release and structural synaptic plasticity. Neuron, 88(4):749 761, 2015.

Cooper, S. J. Donald o. hebb s synapse and learning rule: a history and commentary. Neuroscience & Biobehavioral Reviews, 28(8):851 874, 2005.

Cybenko, G. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303 314, 1989.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

Deng, L. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141 142, 2012.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Elsken, T., Metzen, J. H., and Hutter, F. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997 2017, 2019.

Gaier, A. and Ha, D. Weight agnostic neural networks. Advances in neural information processing systems, 32, 2019.

Golubeva, A., Neyshabur, B., and Gur-Ari, G. Are wider nets better given the same number of parameters? ar Xiv preprint ar Xiv:2010.14495, 2020.

Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016.

Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation, 18: 1527 1554, 2006.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700 4708, 2017.

Huynh, P. K., Varshika, M. L., Paul, A., Isik, M., Balaji, A., and Das, A. Implementing spiking neural networks on neuromorphic architectures: A review. ar Xiv preprint ar Xiv:2202.08897, 2022.

Indyk, P. and Silwal, S. Faster linear algebra for distance matrices. ar Xiv preprint ar Xiv:2210.15114, 2022.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84 90, may 2017. ISSN 00010782.

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.

Le Cun, Y. et al. Lenet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 20(5):14, 2015.

Li, W., Wang, L., Li, W., Agustsson, E., and Van Gool, L. Webvision database: Visual learning and understanding from web data. ar Xiv preprint ar Xiv:1708.02862, 2017.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012 10022, 2021.

Dynamics-inspired Neuromorphic Visual Representation Learning

Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310 1318. PMLR, 2013.

Patil, P. A. and Kulkarni, C. A survey on multiply accumulate unit. In 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1 5. IEEE, 2018.

Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Chen, X., and Wang, X. A comprehensive survey of neural architecture search: Challenges and solutions. ACM Computing Surveys (CSUR), 54(4):1 34, 2021.

Roy Chowdhury, A., Sharma, P., and Learned-Miller, E. G. Reducing duplicate filters in deep neural networks. 2018.

Scarselli, F. and Tsoi, A. C. Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural networks, 11(1): 15 37, 1998.

Sethian, J. A. and Chopp, D. Motion by intrinsic laplacian of curvature. Interfaces and Free boundaries, 1(1):107 123, 1999.

Stanley, K. O. and Miikkulainen, R. Evolving neural networks through augmenting topologies. Evolutionary computation, 10(2):99 127, 2002.

Stork, J., Zaefferer, M., and Bartz-Beielstein, T. Improving neuroevolution efficiency by surrogate model-based optimization with phenotypic distance kernels. In International Conference on the Applications of Evolutionary Computation (Part of Evo Star), pp. 504 519. Springer, 2019.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818 2826, 2016.

Weinberg, S. The quantum theory of fields, volume 2. Cambridge university press, 1995.

Zhang, R., Redford, S. A., Ruijgrok, P. V., Kumar, N., Mozaffari, A., Zemsky, S., Dinner, A. R., Vitelli, V., Bryant, Z., Gardel, M. L., et al. Spatiotemporal control of liquid crystal structure and dynamics through activity patterning. Nature materials, 20(6):875 882, 2021.

Dynamics-inspired Neuromorphic Visual Representation Learning

A. Symbol Glossary

The label on the right indicates the first time the symbol or notation is defined or used. As usual, R, Z, C, and N denotes the reals, the integers, the complex numbers and the natural numbers, respectively.

q(l,t) i Rd The dynamical states of a sub-model i of subsystem l at time-step t Sec 2

E(l,t) i : Rd 7 Rs Vector field for the signals emitted by q(l,t) i Figure 2

R(l,t) i : Rd 7 Rs Vector field for the signals received by q(l,t) i Figure 2

R(l,t) i , E(l,t) i The simplified notations for R(l,t) i (0) and E(l,t) i (0) Figure 2

Nl The number of sub-models in the subsystem P (l) Sec 3

R(l,t), E(l,t) The simplified notations for [R(l,t) 1 , ..., R(l,t) Nl ] and [E(l,t) 1 , ..., E(l,t) Nl ] Sec 3

S(l,t,t ) ij (v) Rs The temporal signal at v Rd from q(l,t) i to q(l,t ) j Figure 2

P(x) Subsystem containing x sub-models Table 1

A, B, C, D, E, F The matrices used to present the principle of dynamic subsystems Eq. 2

φ : Rd Rd 7 R A metric function that helps to compute the path integral between sub-models Eq. 2

P (l) Subsystem with an index of l Sec 2

G A directed graph that describes the topological relations amongst the subsystems Sec 2

ΨR, ΨE, ΨQ The subsystem-related mappings that describe the dynamic properties of a subsystem l Eq. 3

l The index of a subsystem such that there is a directed edge from P (l ) to P (l) in graph G Eq. 3

Fij The stress force between sub-models qi and qj Eq. 6

vij Relative position between the current q(l,t) i and its adjacent q(l+1,t) j Eq. 6

v(xy) ij Relative position between arbitrary q(x) i and another arbitrary q(y) j Eq. 7

φ(xy) Path integrals between each sub-model in P (x) and P (y) Eq. 7

Φ(xy) j The signals received by q(y) j from all the sub-models in P (y) Eq. 7

µ(x;y) h The shared coefficient related to the sub-models q(x) h and q(y) h Eq. 5

σ An activation function, e.g., Sigmoid function Eq. 1

δ The inverse resolution; a lower δ means a better parameter precision Eq. 12

pδ A function that truncates the model s parameters in terms of δ Eq. 12

ε An error threshold Eq. 1

ϵ A variational unit that approaches to zero Eq. 2

ψ The structural entropy Eq. 4

H The number of copies Eq. 5

L The Lagrangian function Eq. 56

L The loss function Eq. 34

T (out) The target output for a neural model Eq. 7

T A target matrix that a Dy N system approximates Eq. 4

Dynamics-inspired Neuromorphic Visual Representation Learning

B. Terminologies Comparison between DNN and Dy N

Table 5: Intuitional Comparisons of DNN and Dy N.

Framework DNN Dy N

Basic components Artificial neurons Sub-models

Structured components Neural layer Subsystem

Component interaction Connection between neurons Path integral between neuronal dynamics

Model update manner Adjust neural weights Adjust neuronal dynamics

Objective function Classification loss function based on gradient descent Entropy reduction based on stress force

Data flow Layered representation Signals

System propagation manner Layer-by-layer Time-by-time

End of training criteria Convergence Neuronal states reach equilibrium

C. Some Principles and Proofs

Theorem C.1. Principle of dynamic subsystems (the existence of global neuronal rules for the dynamic system as a universal approximator): For every d, s, M, N N, given a system of sub-models with a set of time-variant coordinates {q(t) i Rd, i [1, N]} that receive and emit time-variant signals R(t) i Rs and E(t) i Rs, then for arbitrary sequential mapping Rs M 7 Rs M that defines the nonlinear relations between R(t) i and E(t) i for any i [1, N], there exists a set of matrices A Rs s, B Rs d, C Rs d, D Rd s, E Rd s, and F Rd d, such that t [1, M]:

j =i E(t ϵ) j φ(q(t ϵ) j , q(t) i )

E(t) i = AR(t) i + Bq(t) i + C d

d dtq(t) i = DR(t) i + EE(t) i + Fq(t) i

where the non-polynomial 2-form φ : Rd Rd 7 R is used to compute the path integral between sub-models.

Proof. According to Lemma C.2 and Lemma C.3, an arbitrary linear transformation requires finite distinct subsystems, and an arbitrary nonlinear transformation TS(R(t) i ) = E(t+ϵ) i can be achieved by a universal rule of dynamics ˆΨ(S),

TS(R(t) i ), q(t+ϵ) i = ˆΨ(S)(R(t) i , q(t) i ) (16)

where E(t) i is the signal emitted from the sub-model q(t) i itself, and R(t) i is the resultant signals received by q(t) i from all the other sub-models. Eq. 16 can be regarded as the static form of Theorem C.1. Likewise, Lemma C.3 reveals that we can also approximate an arbitrary nonlinear transformation TQ(q(t) i ) = q(t+ϵ) i by a universal rule of dynamics ˆΨ(Q),

E(t+ϵ) i , TQ(q(t) i ) = ˆΨ(Q)(R(t) i , TLQ(q(t) i ,

tq(t) i )) (17)

where TLQ is a linear transformation to interpret the continuous transformation of the dynamical states. Thus, Eq. 17 can be reinterpreted via a rule of dynamics in the linear form ˆΨ(LQ),

E(t+ϵ) i , TQ(q(t) i ) = ˆΨ(LQ)(R(t) i , q(t) i ,

tq(t) i ) (18)

Dynamics-inspired Neuromorphic Visual Representation Learning

Then the dynamical form Eq. 15 can be proved by induction. Specifically, given a set of signals E(t) = {E(t) i } and R(t) = {R(t) i } accompanied by Q(t) = {q(t) i }, we first reach arbitrary expected states of signals {R(t), E(t+ϵ)} via Eq. 16, then we reach arbitrary expected states of sub-models Q(t+ϵ) via Eq. 18, followed by a successive reaching for an arbitrary targeted signals {R(t+ϵ), E(t+2ϵ)}. This recursive step finally constructs a complete dynamic form of Lemma C.3. This fact implies that the sub-models with specific dynamical states can approximate arbitrary instant nonlinear transformation. A specific predecessor state can obtain such specified dynamical states under the constraint that the received signals are provided.

Lemma C.2. The weighted sum of H distinct distance matrices is sufficient to approximate any matrix T Rm n in any degree of precision. Specifically, the upper bound of an optimized H is given by

Hoptimized mn d (m + n) (19)

where d refers to the dimension of units whose L2-norms are computed to generate those distance matrices.

Proof. First, we count the number of distinguished values an arbitrary quantized matrix can have. The matrix elements Tij range from 0 to 1 with a resolution 1/δ that divides the domain into 1/δ partitions. Then the total number of distinguished values is Φ(T) = δ mn. Likewise, the permutations of m units Qrow Rm d and n units Qcol Rn d are Ω(Qrow) = δ md and Ω(Qcol) = δ nd, implying that the number of combinations of Qrow and Qcol is Ω([Qrow; Qcol]) = Ω(Q) = δ d(m+n). Then we need to eliminate the duplicate states of Q, which can be categorized into three cases, i.e., self-permutation (SP), transitional invariance (TI), and rotational invariance (RI). Specifically, SP refers to the case that when the identities of sub-models are exchanged, the resulting distance matrix remains unchanged. TI and RI mean that when the sub-models of Qrow and Qcol move together in a translational or rotational way, the resulting distance remains unchanged. The numbers of the three cases are approximated as follows:

ΩSP (Q) = (m! n!)d

ΩT I(Q) = δ 1

k=1 (1 Lrow k Lcol k )

ΩRI = Sf (d)(max(Lrow k Lcol k ) δ 1)

where Lrow k = max(Qrow[:, k]) min(Qrow[:, k]) measures the maximum range of Qrow in the k-th dimension, and Sf (d)

is the spherical area of d-sphere. Therefore, each pair of Qrow and Qcol covers ˆΩ(Q) non-duplicate states:

ˆΩ(Q) = Ω(Q) ΩSP (Q) ΩT I(Q) ΩRI(Q) (21)

Then the total number of non-duplicate states achieved by H subsystems is

Ω([Q(1); Q(2)...Q(H)]) = ΩH(Q) = (ˆΩ(Q)) H

Our goal is to find a proper H such that ΩH(Q) = Ω(T), yielding

lim δ 0 H = mn d (m + n) (23)

which is consistent with Eq. 19.

Lemma C.3. Existence of universal rule for static subsystems: given a Dy N system composed of sub-models whose rules of dynamics determine their successive states interacted with emitted or received signals, for any continuous function that maps the receival signals to the emitted signals of all the sub-models, there exists a universal rule of dynamics followed by every sub-model, ensuring that the relation between the receival signals and the emitted signals of all the sub-models approximates the provided continuous function. The universal dynamics rule is mathematically equivalent to a set of linear transformations.

Dynamics-inspired Neuromorphic Visual Representation Learning

Proof. Suppose there exists an expected universal rule Ψ, and a fake sub-model q(t) that complements the signals received by each real sub-model. Each real sub-model q(t) i is equipped with an exclusive rule of dynamics Ψi that might be different from Ψ, we have E(t+ϵ) i , q(t+ϵ) i = Ψi(R(t) i , q(t) i ) = Ψ(R(t) i + S(t) i , q(t) i ) (24)

Taking the total derivatives over time-step on the middle and right sides of Eq. 24

Ψi q(t) i q(t) i t + Ψi

R(t) i R(t) i t = Ψ

q(t) i q(t) i t + Ψ

R(t) i R(t) i t + Ψ

S(t) i S(t) i t (25)

The signal emitted from the fake sub-model is expected to be a zero constant, therefore, Ψi

q(t) i t = Ψ

R(t) i R(t) i t

R(t) i t (26)

since q(t) i and R(t) i cannot be constant variables, the Eq. 26 is satisfied for all cases if and only if

Ψi q(t) i Ψ

R(t) i R(t) i t = 0 (27)

which implies that Ψ(R(t) i , q(t) i ) = Wq q(t) i + WR R(t) i + b = Ψi(R(t) i , q(t) i ) + b (28)

where Wq, WR, b and b are proper matrices. Hence, if each sub-model s rule of dynamics refers to its complete linear transformation ψi, then there exists a universal linear transformation, substituting for each specified ψi such that the expected nonlinear transformation of the Dy N system remains unchanged based on Lemma C.4.

Lemma C.4. If each sub-model s dynamics rule is a linear transformation, the configured system can approximate any continuous function to the expected precision.

Proof. According to Lemma C.2, the configured system can approximate any matrix corresponding to the tensor-based neural weights by manipulating the sub-models. Therefore, the combination of each sub-model s linear transformation and the overall weights between each pair of sub-models is equivalent to a feedforward neural network with arbitrary width as illustrated in the universal approximation theorem (Scarselli & Tsoi, 1998).

D. Path integral formulation for neuromorphic system

Recall that the emitting signals are denoted by E(t) i Rd, and the receiving signals are denoted by R(t) i . We also introduce the dynamical signal S(t ,t) ij (v) Rd to analyze the temporal state of the signal from q(t ) i to q(t) j . As presented in

Eq. 3, we can represent the relation between S(t ,t) ij (q(t ) i ) and S(t ,t) ij (q(t) j ) via a nonlinear subsystem-related function Ψ : Rs Rd 7 Rs as follows: S(t ,t) ij (q(t) j ) = Ψ S(t ,t) ij q(t ) i , q(t) j q(t ) i (29)

This equation is also mathematically equivalent to the path integral between q(t ) i and q(t) j as follows:

S(t ,t) ij (q(t) j ) = Z q(t) j

q(t ) i ˆΨ S(t ,t) ij (v), v dv (30)

where ˆΨ : Rs Rd 7 Rs refers to the global field consisting of several Ψ(l)s corresponding to distinct subsystems P (l)s. In fact, we can conclude that the relation between ˆΨ and Ψ(l)s holds for:

ˆΨ(S, u) = X

l W(l)Ψ(l)(S, u) (31)

where W(l)s are trainable linear transformations. Under this setting, we can assume that the emitting time t is irrelevant because the biases induced by the time delay can be learned by the W(l)s. Specifically, the bias induced by each subsystem is polynomial to the resultant bias induced by all the subsystems. Thus, the relation between the biases can be approximated via a weighted sum of linear transformations.

Dynamics-inspired Neuromorphic Visual Representation Learning

E. Principle of hierarchical structures

Let s define S(k,t) ij as the signals emitted from q(l,t ) i at an unspecified time-step t and received by q(k,t) j at a specified time-

step t. Then the directed edge from P (l) to P (k) means that there exists a linear mapping T (lk) : E(l,t) i q(l,t) i q(k,t+1) j 7

S(k,t+1) ij and a linear mapping T (k) : q(k,t) j S(k,t) ij 7 q(k,t+1) j such that

E(k,t+1) j = X

i =j T (lk)(E(l,t) i , q(l,t) i , T (k)(q(k,t) j , T (lk)(E(l,t 1) i , q(l,t 1) i , q(k,t) j ))) (32)

According to Lie algebra homomorphism, there exists a nonlinear mapping T (lk) such that

t E(k,t+1) j ,

tq(k,t) j = X

i =j T (lk)(

t E(l,t 1) i ,

tq(l,t) i ) (33)

The Eq. 33 is the principle of hierarchical structures (Po HS). Suppose a tree-based structure describes the recursive relations between the root system and its subsystems, sub-subsystems, etc.. Then Po HS states that this tree-based structure is equivalent to a linearly hierarchical structure containing a set of subsystems. Furthermore, Eq. 33 reveals that a well-formed neuromorphic system does not require a specified set of discrete trainable units isolated from each other.

F. Consistency with back-propagation

First, let s deduce Eq. 6 using approaches applied in back-propagation. This equation is initially derived from a dynamicsinspired view in the main paper. Specifically, we will deduce Eq. 6 by computing the gradient of the loss function for each trainable parameter by the chain rule. The loss function between two arbitrary sub-models q(l) i and q(l+1) j of distinct subsystems is defined as Lij = (Aij φ(q(l) i , q(l+1) j ))2

j=1 Lij (34)

where A RM N and [φ(q(l) i , q(l+1) j )]M,N RM N are, respectively, the target matrix and the weighted distance between

sub-models (defined in Eq. 5). Then we calculate the partial derivative of the loss function Lij for a sub-model q(l) i;h as

q(l) i;h = 2Fij µ(l;l+1) h vij;h vij;h p (35)

where vij;h = q(l) i;h q(l+1) j;h , and the stress force between sub-models under a target A is denoted by Fij = Aij

φ(q(l) i , q(l+1) j ). Since the collective loss function for a sub-model q(l) i;h is Li, thus,

q(l) i;h t =

q(l) i;h = µ(l;l+1) h

vij;h vij;h p Fij (36)

which is consistent with Eq. 6. Similarly, the update rules for q(l+1) j;h and µ(l;l+1) h are accessible via computing the gradient of the relevant loss function. These facts guarantee that these dynamics-inspired update rules are consistent with the rules derived via computing gradient descent for a specified loss function like back-propagation does. Therefore, we can extend Eq. 6 to a detailed formulation by applying back-propagation on the stress force Fij, which is replaced with the gradient of loss function for the sub-models rather than the neural weights.

Given a multilayer perceptron that contains two tensor-formed weights W (ih) RNin Nh and W (ho) RNh Nout. The transmitting signals along with each layer are defined using a sequence: [R(in), E(in) = R(in), R(h), E(h) = σ(R(h)), R(out), E(out) = R(out)]. Our goal is to make E(out) T (out). First, we define a computation-friendly formulation of the path integral between two arbitrary sub-models q(x) i and q(y) j as follows.

I(xy) ij = 1

2(q(x) i q(y) j ) 2 = 1

h=1 µh v(xy) ij;h

Dynamics-inspired Neuromorphic Visual Representation Learning

where H is the number of shared coefficients required to convert non-linearity into a good linearity set. The number of H is discussed in Lemma C.2. Recall that the signals along with each layer are computed as follows

i σ(R(in) i ) I(in;h) ij

R(out) k = X

j σ(R(h) j ) I(h;out) jk (38)

The loss function is defined by

1 2(R(out) k T (out) k ) 2 = X

1 2ε2 k (39)

The resultant signals received by different sub-models are defined by

Φ(xy) j = X

i E(xy) ij = X

i σ(S(x) i ) v(xy) ij

Φ(in) i = σ(S(in) i )

Φ(out) k = T (out) k S(out) k

Instead of computing the gradient of the loss function for the weights (path integral Iij), we compute the gradients for the dynamical states of a sub-model as q(out) k

q(out) k t L

q(out) k = X

I(h;out) jk

I(h;out) jk q(out) k

j σ(S(h) j ) v(h;out) jk

= Φ(out) k Φ(h;out) k

A more detailed update rule regarding µh is as follows:

q(out) k;h t L

q(out) k;h = Φ(out) k Φ(h;out) k µh (42)

Similarly, we compute the gradient of the loss function for q(in) i ,

q(in) i t = L

q(in) i = X

I(in;h) ij q(in) i

S(out) k S(h) j

S(h) j I(in;h) ij

I(in;h) ij q(in) i

k εk I(h;out) jk σ(S(h) j )

S(h) j σ(S(in) i ) v(in;h) ij

= σ(S(in) i ) X

j v(in;h) ij σ(S(h) j )

k εk I(h;out) jk

= Φ(in) i X

j v(h;in) ji σ(S(h) j ) (1 σ(S(h) j )) Φ(out;h) j

In the equilibrium state (meaning that the feedback signal Φ(ou;h) j is extremely weak and stable), term (1 σ(S(h) j )) Φ(out;h) j is degenerated to a specific constant independent of index j, so that Eq. 43 can be approximated as

q(in) i t Φ(in) i X

j v(h;in) ji σ(S(h) j )

= Φ(in) i Φ(h;in) i

Dynamics-inspired Neuromorphic Visual Representation Learning

Eq. 44 is obviously consistent with Eq. 7. Now we have the dynamical forms of update rules for sub-models toward a specific loss function. In other words, we can approximate arbitrary nonlinear functions via training the sub-models rather than the neural weights connecting them.

G. Generalized rules of dynamics in Dy N systems

The neuromorphic dynamics are derived from Hamilton s principle and the Euler-Lagrange equation:

d dt L(l,t) i

q(l,t) i L(l,t) i q(l,t) i = 0 (45)

where the Lagrangian L(l,t) i = S(l,t) i ψ(l,t) i measures the energy distribution of signals S(l,t) i and structural entropy ψ(l,t) i . According to Lagrangian mechanics described in Eq. 45, where the non-relativistic Lagrangian L for sub-models in a specific subsystem is defined by

L = T V = T = 1

2m0 r2 (46)

where r represents the dynamical state of a sub-model. Thus

Then substitute Eq. 47 into Eq. 45, obtaining

Summing both sides of Eq. 49

r(t) xk t 2r(t) xk t2 = m0

L(t) xk t (49)

To satisfy the conservation of momentum and Newton s third law, we have

Then the middle term of Eq. 49 can be simplified to

= m0 r(t) xx t 2r(t) xx t2 (51)

Summing both sides of Eq. 46

k =x L(t) xk = m0

Then according to Eq. 50, Eq. 52 can be simplified to

k =x L(t) xk = m0

Then we substitute Eq. 53 into Eq. 51 to eliminate m0, obtaining

2r(t) xx t2 = 1

k =x t L(t) xk P

k =x L(t) xk r(t) xx t = Λ(t) x 2 r(t) xx t (54)

Dynamics-inspired Neuromorphic Visual Representation Learning

(a) Frame-id=1

(b) Frame-id=5

(c) Frame-id=10

Figure 6: Equivalence between neuromorphic learning and entropy reduction. As presented by Eq. 56, the sub-models tend to move toward the region with lower structural entropy, which is visualized by colored spatial distribution.

Note that L can be approximated as time-invariant when t 7 0 since L varies with the combination of all sub-models and signals, whose overall dynamics are relatively static for a particular sub-model. Then we solve the differential Eq. 54, which yields r(t) xx t = ηx exp

where the entropy indicator Λ(t) x measures the structural entropy (can be evaluated via methods similar to Eq. 13) of L over the system of sub-models, and ηx is a constant value related to its corresponding sub-model q(t) x .

The comprehensive form of Eq. 55 is as follows:

r(l,t) i t = ηi exp

k =i t L(l,t) ik P

k =i L(l,t) ik t

= ηi exp Λ(l,t) i t (56)

where r(l,t) i is the positional vector of a sub-model, and ηi is a constant related to q(l,t) i . This equation is equipped with an unspecific Lagrangian L, for instance, L(l,t) ik = S(l,t) ik Φ(l,t) i , where S(l,t) ik is emitted from q(L,t ) i and received by q(l,t) k , being influenced by the resultant potential field Φ(L,t) i around q(L,t) i . The signals S(l,t) ik refers to the feedback control correlated with the loss function for the current task, and the potential field Φ(l,t) i is a trainable parameter related to distinct sub-models. Note that we can simplify Φ(l,t) i as a constant field by adding shared coefficients applied in Eq. 5.

H. Conservation of workload and computational complexity with increasing number of sub-models

By Hamilton s principle, the variables in Eq. 54 and Eq. 55 should have several restrictions, including X

k =x L(t) xk = CL

t L(t) xk = C L

where Cη, CL and C L are time-invariant constants. Therefore, the summation of the entropy indicator Λ(t) x can also be approximated as a time-invariant constant

x=1 Λ(t) x C L

CL = N Λ (58)

Dynamics-inspired Neuromorphic Visual Representation Learning

Figure 7: (a) The ratio of output Cv to input Cv as the number of layers increases with different numbers of sub-models in the hidden layer (width). (b) The horizontal axis measures the logarithm of the ratio of parameters size and resolution (LRPR); the vertical axis measures the testing accuracy of the truncated models corresponding to each specific LRPR.

where N is defined as the total number of sub-models, and Λ is defined as a constant referring to the averaged entropy indicator. Therefore, the total path length of all sub-models can be approximated in terms of Eq. 54 as a time-variant function I(t)

x=1 ηx exp ( Λ(t) i t) Cη exp ( Λ t) (59)

The time-step t is a small value since the sub-models of a Dy N system generate signals almost instantaneously. The total workload W(T) that is linearly correlated with the computational complexity is evaluated

0 I(t)dt = NCηCL

C L (1 e ΛT ) (60)

where T is the total time required to reach an equilibrium state. Based on Eq. 60, as the required number of sub-models increases largely to deal with an increasingly complicated computational task, the total workload and computational complexity do not increase accordingly but gradually approach a specific value. This fact also implies a neuro-biological correlation that when the brain arises a concept, the power of the cortical regions related to the concept remains unchanged after several learning events that supply more specified knowledge.

I. Learning with more subsystems and more sub-models

We apply an interactive mechanism to the current architecture to boost the computational ability of a Dy N system without explosive growth of computational complexity. The principle of hierarchical structures (Eq. 33) implies that a well-formed neuromorphic system does not require a specified set of discrete trainable units isolated from each other. Besides, inspired by the phase space density representation (Abarbanel & Rouhi, 1987), a dynamic system represented by infinite interactive particles can be treated as a linear combination of many shallow layers, each of which is interpreted as an isolated dynamic system of different density. Specifically, each layer of density ρi refers to a subsystem with a specific shared coefficient hi, which disassembles the overall neuromorphic system into several partially independent subsystems. Therefore, a discrete submodel q(l;t) i;ki is equivalent to an interactive region with a density of hki. The variational density gij = ρki ρkj = hki hkj measures the potential energy generated by the interaction (receiving or emitting signals) between sub-models

mi tq(l,t) i;ki = Z

q(l,t) j;kj P (l) ψ(v(l,t) ij , f (l,t) ij , g(l,t) ij )d P (l) (61)

where mi is a constant related to the density hki, and ψ is a linear transformation concatenates sub-models variational dynamics. The notations here are consistent with Eq. 6. Therefore, a Dy N system with infinite sub-models can be approximated as the one with finite subsystems, in which the dynamical states of sub-models are interactively correlated with other sub-models from all subsystems. The increased number of computing units has led to some burden in the

Dynamics-inspired Neuromorphic Visual Representation Learning

software implementation but no increase in the overall computational complexity. This fact is validated experimentally and mathematically (Appendix H).

Figure 8: (a) As the number of parameters transformed and instantiated via physical view increases, the Dy N models outperform their original forms more significantly as the improved accuracy increases. The presented models are distinguished in terms of the inverse of resolution δ and the stress-threshold that determines when Dy N process ends. (b) The Swin Dy N refers to the pre-trained Swin T-S-224 model whose neural layers are converted into Dy N forms. As δ increases, though the original model performance greatly degrades, the Dy N alternatives remain almost unchanged. The interactive one (Eq. 61) outperforms the isolated one (Eq. 6). This phenomenon is attributed to the global dynamics of sub-models that are moving and interacting with others in a fully stabilized manner.

J. Limitations

Initialization and convergence. In most cases, with a better initialization and a proper setting of hyper-parameters, there could be a phenomenon of few-step convergence. However, this phenomenon is not stable if we initialize Dy N with some extreme settings. Even though the overall convergence rate of Dy N is better than ANN, we still need to dig inside before reaching a validated conclusion.

Simultaneity and time delay. In our simulation and the experiments, we update the sub-models similarly to an ANN (epochs-by-epochs). An epoch is undertaken as follows, i.e., the feedback signals result in the stress forces amongst each sub-model, and the stress forces cause the sub-models to change their dynamical states. This setting works in the software implementation, and we do not need to guarantee that the neurons are strictly updated simultaneously. However, in the idealized setting where the physical properties matter, we need to consider the issues with simultaneity.

Memory usage and parallelism. In the common cases of Dy N, the memory used in the inference stage is much lower than that of ANN. For model training, the memory consumed by the trainable parameters (neural states in sub-models) is far less than that of an ANN. Nevertheless, we need extra memory to expand the dynamical states to compute the path integrals. The upper bound of the total memory, including the trainable parameters and the calculation of path integrals, would not exceed the original ANN (Lemma C.2). However, there is still room for further reduction of the memory consumption for model training.