# deep_signature_characterization_of_largescale_molecular_dynamics__e0a906e5.pdf Published as a conference paper at ICLR 2025 DEEP SIGNATURE: CHARACTERIZATION OF LARGESCALE MOLECULAR DYNAMICS Tiexin Qin1, Mengxu Zhu1, Chunyang Li2, Terry Lyons3, Hong Yan1, Haoliang Li1, City University of Hong Kong1 & Chengdu Institute of Biological Products co. Ltd2 & University of Oxford3 {tiexinqin,mengxuzhu}2-c@my.cityu.edu.hk,lichunyang@sinopharm.com tlyons@maths.ox.ac.uk, {ityan,haoliang.li}@cityu.edu.hk Understanding protein dynamics are essential for deciphering protein functional mechanisms and developing molecular therapies. However, the complex highdimensional dynamics and interatomic interactions of biological processes pose significant challenge for existing computational techniques. In this paper, we approach this problem for the first time by introducing Deep Signature, a novel computationally tractable framework that characterizes complex dynamics and interatomic interactions based on their evolving trajectories. Specifically, our approach incorporates soft spectral clustering that locally aggregates cooperative dynamics to reduce the size of the system, as well as signature transform that collects iterated integrals to provide a global characterization of the non-smooth interactive dynamics. Theoretical analysis demonstrates that Deep Signature exhibits several desirable properties, including invariance to translation, near invariance to rotation, equivariance to permutation of atomic coordinates, and invariance under time reparameterization. Furthermore, experimental results on three benchmarks of biological processes verify that our approach can achieve superior performance compared to baseline methods. 1 INTRODUCTION Biological processes are fundamentally driven by the dynamical changes of macromolecules, particularly proteins and enzymes, within their respective functional conformation spaces. Typical examples of such processes include protein ligand binding, molecule transport and enzymatic reactions, and modern computational biologists investigate their underlying functional mechanisms by molecular dynamics (MD) simulations (Dror et al., 2012; Lewandowski et al., 2015). Built upon density functional theory (Car & Parrinello, 1985), MD has demonstrated remarkable capability in providing accurate atomic trajectories in three-dimensional (3D) conformational space and consist agreement with experimental observations (Frenkel & Smit, 2023). The computational analysis of MD data has been a subject of extensive research for decades, with the goal of characterizing systems from trajectory information. However, due to the main challenge posed by intricate interatomic interactions over large-scale systems across inconsistent timescales, many existing works resort to oversimplified setups that incorporate biophysical priors to analyze certain aspects of dynamics such as protein fluctuations, relaxation time, stability, and state transitions (Law et al., 2017; Qiu et al., 2023). More recently, empowered by the parallel processing ability of GPUs, machine learning especially deep learning sheds new light on this field as it can discretize macromolecules as particles distributed in a 3D voxel grid and automatically learn their relations in a data-driven fashion (Li et al., 2020a; Rogers et al., 2023). Parallel to this, surface modeling-based approaches have emerged, firstly utilizing mathematical models to restore protein surfaces and then applying deep learning to analyze the chemical and geometrical features of surface regions around binding sites (Gainza et al., 2020b; Zhu et al., 2021). Despite the great potential of these approaches in automatic drug discovery, their computational complexity would increase linearly with the number of time stamps when processing MD data, struggling with application to Corresponding author. Published as a conference paper at ICLR 2025 long-time simulations. Besides, these methods commonly build upon coarse grained dynamics for accelerating computation. Nevertheless, selecting an optimal coarse graining mapping strategy that effectively simplifies the representation of the system while preserving essential features remains an open research problem (Jin et al., 2022; Majewski et al., 2023). Another limitation of current MD analysis methods is the deficient utilization of structural bioinformatics for the largely increased difficulty in handling high-order interatomic interactions during dynamic processes. However, such structural bioinformatics, manifested in various covalent and non-covalent bonds, plays a pivotal role in molecular design for its capability of propagating local perturbations to facilitate conformational dynamics and alter biological function (Tsai et al., 2009; Otten et al., 2018). An illustrative example would be dihydrofolate reductase, which has been widely studied as important antitumor and antibacterial targets for treating tuberculosis and malaria. There exist four common mutations that confer drug resistance to antibiotics, proceeding in a stepwise fashion. Among them, the P21L mutation acts in a dynamical loop region associated with long range structural vibrations of the protein backbone, rather than directly on the active sites as other mutations (Toprak et al., 2012). Therefore, ignoring such interatomic interactive dynamics facilitated by molecular structure for a critical protein and counting the effects of active sites solely can result in biased assessments of designed drugs. Nonetheless, since the integration of structural bioinformatics into MD analysis would further introduce at least quadratic complexity with system size, existing works have not yet investigated this crucial aspect, highlighting a critical gap in our ability to comprehensively analyze and predict molecular behavior in drug design and resistance studies. To this end, we aim to develop a computationally efficient framework that incorporates the structural bioinformatics with coarse graining mapping for automatically analyzing protein trajectory dynamics. In particular, we first introduce a graph clustering module that learns to extract coarse grained dynamics by approximating soft spectral clustering. With the clustering assignment function implemented by a graph neural network and parameters learned automatically, we circumvent the need for manual selection of coarse graining mapping. Subsequently, we introduce a path signature transform module served as a feature extractor to characterize the interatomic interactive dynamics after coarse graining. Path signature is a mathematically principled concept that utilizes iterated integrals to describe geometric rough paths in a compact yet rich manner (Lyons, 2014), thus suitable for our tasks where molecular trajectories are highly sampled and non-smooth. After attaching with a task-specific differentiable classifier or regressor, we devise an end-to-end framework, named Deep Signature, for efficiently characterizing the complex protein dynamics. Notably, due to the existence of considerable random fluctuation in simulated trajectories, ideal features ought to maintain symmetry respecting certain geometric transformations. We provide theoretical analysis that our extracted features exhibit invariance to translation, near invariance to rotation, equivariance to permutation of atomic coordinates, and invariance under time reparametrization of paths. Finally, we target our task on predicting functional properties of proteins from MD data, a fundamental task for developing novel drug therapies. We consider three benchmarks including gene regulatory dynamics, epidermal growth factor receptor (EGFR) mutation dynamics, and G protein-coupled receptors (GPCR) dynamics for performance evaluation. The contributions of our paper are as follows: We develop Deep signature, the first computationally efficient framework that characterizes the complex interatomic interactive dynamics of large-scale molecules. We theoretically demonstrate that our approach preserves symmetry under several geometrical transforms of atomic coordinates in 3D conformational space. Additionally, our method remains invariant under time reparameterization. We provide empirical results to show that our Deep Signature model achieves superior performance compared to other baseline methods on gene regulatory dynamics, EGFR mutation dynamics and GPCR dynamics. 2 RELATED WORKS Molecular Representation. Encoding essential structural characteristics and biochemical properties into molecular representations is a long-standing research field in molecular biology, with wide applications in various drug discovery processes including virtual screening, similarity-based compound searches, target molecule ranking, drug toxicity prediction, and etc. (Li et al., 2024). One of the most widely used categories of molecular representation is two-dimensional fingerprints that extract the substructure, topological routes and circles solely from molecular connection tables. Owing to their ease of generation and usage, these 2D fingerprints are still extensively utilized as input Published as a conference paper at ICLR 2025 for machine learning algorithms in modern drug discovery applications (Gao et al., 2020). In recent years, there has been a surge in the development of 3D structure-based molecular representations for their much more fine-grained characterization of interatomic interplay in 3D conformational space. Surface modeling-based approaches leverage mathematical models to restore protein surfaces and encode geometrical and chemical features of surface regions around binding sites into representations, exhibiting great potential in automatic drug discovery (Gainza et al., 2020a; Zhu et al., 2021). In addition to surface-based representations, significant efforts have been devoted to learning accurate deep learning force fields for accelerating MD simulation (Sch utt et al., 2017a; Batatia et al., 2022; Batzner et al., 2022). While these methods can be adapted for molecular property prediction, they are limited to generating representation for a static frame, thus not suitable for our tasks where interframe interactions along timescales are crucial for understanding protein function. The current investigation into characterizing molecular dynamics, especially interatomic interaction dynamics, remains limited, with only a few studies close to ours. Among them, Endo et al. (2019), Yasuda et al. (2022) and Mustali et al. (2023) are unsupervised methods that build local dynamics ensembles for pre-specified atoms and inspect each atom s contribution independently. Li et al. (2022) converts MD conformations into images and applies convolutional neural networks to identify diverse active states. Sun et al. (2023) models protein dynamics by representing protein surfaces using implicit neural networks without requiring explicit surface representations. Nevertheless, these approaches cannot capture subtle interatomic interactions along atomic pathways for dissecting protein function, nor is the complexity of their representations independent of time stamps. Coarse Graining (CG). CG is a widely adopted technique with the objective of preserving the crucial characteristics and dynamics inherent to a molecular system. This is achieved by grouping sets of atoms into CG beads, thereby enabling high-throughput MD simulations over larger time and length scales. Existing CG methods can be broadly categorized into two types: chemical and physical intuition-based approaches and machine learning-based approaches. The methods of first type construct the CG mapping by incorporating various biochemical properties derived from expert knowledge, for example, mapping each elaborately constructed cluster of four heavy (nonhydrogen) atoms into a single CG bead (Marrink & Tieleman, 2013) or simply assigning one CG bead centered at the α-carbon for each amino acid (Ing olfsson et al., 2014). Machine learning-based methods can rapidly learn accurate potential energy functions for reduced structures of MD by training on large databases. Recent advancements in this area include multiscale coarse graining that optimizes to maximize a variational force matching score (Wang & G omez-Bombarelli, 2019), relative entropy minimization (Thaler et al., 2022), and spectral graph approaches that account for structural typologies of proteins (Webb et al., 2018; Li et al., 2020b). However, their transferability to unseen molecules remains suspicious, and the representation capability for complex macromolecular systems without increasing dimension and complexity is still underexplored (Khot et al., 2019). 3 METHODOLOGY 3.1 PROBLEM FORMULATION Consider various molecular systems S(k) that are distinct in their molecular behaviors. An MD simulation trajectory on S(k) provides the trajectories for all Nk atoms constituting the molecules, and can be represented as a sequence of snapshots X(k) 1:Tk = {X(k) 1 , X(k) 2 , . . . , X(k) Tk }. Here, X(k) t RNk 3 indicates the atomic positions in 3D conformational space at time step t for t {1, . . . , Tk} and Tk is the total number of frames. To describe the structure of molecules within S(k), we define a molecular graph G(k) = {E(k), V(k)}, |V(k)| = Nk, where V(k) is the node set corresponding to the atoms and E(k) is the edge set corresponding to chemical bonds. The adjacency matrix of G(k) is represented by A(k) RNk Nk, with A(k) i,j = 1 if vi, vj V(k) and (vi, vj) E(k). For the molecular property prediction task, we have access to MD trajectories from K molecular systems, each endowed with property labels denoted as {(X(k) 1:Tk, y(k))}K k=1. The objective is to train algorithms to accurately predict the property label when provided with an previously unseen MD trajectory X1:Tk. 3.2 DEEP SIGNATURE Our proposed method, referred to as Deep Signature, consists of a deep spectral clustering module that uses GNNs to extract coarse grained dynamics from raw MD trajectories, a path signature trans- Published as a conference paper at ICLR 2025 3D conformational space CG conformational space Path signature features Path Signature Transform Module Deep Spectral Clustering Module Figure 1: An overview of our proposed Deep Signature method. form module that collects iterated integrals to characterize interatomic interactions along pathways, and a classifier to enable property prediction. The overall architecture is illustrated in Fig. 1. Deep spectral clustering with GNNs. Given the MD trajectory X1:T and molecular graph G for a molecular system S, we start with extracting the reduced trajectory X1:T by coarse gaining X1:T using deep spectral clustering. Specifically, we first obtain node representations via GNN layers as Hl = σ( D 1/2 A D 1/2Hl 1Wl 1 GNN), (1) where Hl denotes the node feature matrix at the l-th layer, H0 = X1:T , A = A + I is the adjacency matrix A plus the identity matrix I, D is the degree matrix of A, Wl 1 GNN are the learnable parameters of GNNs, and σ is a nonlinear activation function. After then, we compute the cluster assignment matrix Q for the nodes using a multi-layer perceptron (MLP) with softmax on the output layer Q = Softmax(WMLPHl + b), (2) where WMLP and b are trainable parameters of the MLP. For the assignment matrix Q RN M, where M N specifies the number of clusters, each row of it represents the node s probability of belonging to a particular cluster. We employ the normalized-cut relaxation (Bianchi et al., 2020) as our clustering objective for minimization Lu = Tr(QT AQ) Tr(QT DQ) + QT Q ||QT Q||F IM where the first term promotes strongly connected components to be clustered together, while the second term encourages the cluster assignments to be orthogonal and have similar sizes. Upon leveraging Q from Eq. (2) for clustering, the corresponding reduced feature embedding matrix H and adjacency matrix A can be derived as follows H = QT H; A = QT AQ. (4) Since our model takes the sequence X1:T as input, H inherently maintains the temporal order in the form of H 1:T . We utilize another MLP with the parameters W MLP and b to map H 1:T back into a reduced conformational space with the resulting dynamics and ground truth dynamics expressed as X1:T = W MLPH 1:T + b ; Xpool 1:T = QT X1:T . (5) To ensure the fidelity of the coarse grained dynamics towards the original high-dimensional system, we further introduce a temporal consistency constraint, defined through a mean absolute error loss function with the form Xi Xpool i . (6) Published as a conference paper at ICLR 2025 Path signature transform. We now adopt the path signature method to characterize the interatomic temporal interactions among the coarse grained dynamics X1:T RT 3M. The basic idea of path signature is that, for a multidimensional continuous path, we can construct an ordered set consisting of all possible path integrals and combinations involving the path integrals among various individual dimensions as a comprehensive representation for this path (Lyons, 2014). Striving for a more precise definition, consider our coarse grained trajectory X1:T with ( X1 t, X2 t, . . . , X3M t ) for t {1, . . . , T}, let us define b X : [1, T] R3M as a piecewise linear interpolation of X1:T such that b Xt = Xt for any t {1, . . . , T}, and a sub-time interval [ri, ri+1] corresponding to a time partition of [1, T] with 1 = r1 < r2 < < rτ = T. The depth-D signature transform of X over the interval [ri, ri+1] is the vector defined as Sig D ri,ri+1( X) = 1, n S( X)j ri,ri+1 o3M j=1 , . . . , n S( X)j1,...,jd ri,ri+1 o3M j1,...,jd=1 where for any (j1, . . . , jd) {1, . . . , 3M}D, S( X)j1,...,jd ri,ri+1 = Z Z 1