# deep_generative_markov_state_models__7e7b9021.pdf Deep Generative Markov State Models Hao Wu1,2, , Andreas Mardt1, , Luca Pasquali1, , and Frank Noe1, 1Dept. of Mathematics and Computer Science, Freie Universität Berlin, 14195 Berlin, Germany 2School of Mathematical Sciences, Tongji University, Shanghai, 200092, P.R. China We propose a deep generative Markov State Model (Deep Gen MSM) learning framework for inference of metastable dynamical systems and prediction of trajectories. After unsupervised training on time series data, the model contains (i) a probabilistic encoder that maps from high-dimensional configuration space to a small-sized vector indicating the membership to metastable (long-lived) states, (ii) a Markov chain that governs the transitions between metastable states and facilitates analysis of the long-time dynamics, and (iii) a generative part that samples the conditional distribution of configurations in the next time step. The model can be operated in a recursive fashion to generate trajectories to predict the system evolution from a defined starting state and propose new configurations. The Deep Gen MSM is demonstrated to provide accurate estimates of the long-time kinetics and generate valid distributions for molecular dynamics (MD) benchmark systems. Remarkably, we show that Deep Gen MSMs are able to make long timesteps in molecular configuration space and generate physically realistic structures in regions that were not seen in training data. 1 Introduction Complex dynamical systems that exhibit events on vastly different timescales are ubiquitous in science and engineering. For example, molecular dynamics (MD) of biomolecules involve fast vibrations on the timescales of 10 15 seconds, while their biological function is often related to the rare switching events between long-lived states on timescales of 10 3 seconds or longer. In weather and climate systems, local fluctuations in temperature and pressure fields occur within minutes or hours, while global changes are often subject to periodic motion and drift over years or decades. Primary goals in the analysis of complex dynamical systems include: 1. Deriving an interpretable model of the essential long-time dynamical properties of these systems, such as the stationary behavior or lifetimes/cycle times of slow processes. 2. Simulating the dynamical system, e.g., to predict the system s future evolution or to sample previously unobserved system configurations. A state-of-the-art approach for the first goal is to learn a Markovian model from time-series data, which is theoretically justified by the fact that physical systems are inherently Markovian. In practice, the long-time behavior of dynamical systems can be accurately described in a Markovian model when suitable features or variables are used, and when the time resolution of the model is sufficiently coarse such that the time-evolution can be represented with a manageable number of dynamical modes [24, 11]. In stochastic dynamical systems, such as MD simulation, variants of Markov state models (MSMs) are commonly used [3, 25, 22]. In MSMs, the configuration space is discretized, e.g., using a clustering method, and the dynamics between clusters are then described by a matrix H. Wu, A. Mardt and L. Pasquali equally contributed to this work. Author to whom correspondence should be addressed. Electronic mail: frank.noe@fu-berlin.de. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. of transition probabilities [22]. The analogous approach for deterministic dynamical systems such as complex fluid flows is called Koopman analysis, where time propagation is approximated by a linear model in a suitable function space transformation of the flow variables [16, 26, 29, 4]. The recently proposed VAMPnets learn an optimal feature transformation from full configuration space to a low-dimensional latent space in which the Markovian model is built by variational optimization of a neural network [15]. When the VAMPnet has a probabilistic output (e.g. Soft Max layer), the Markovian model conserves probability, but is not guaranteed to be a valid transition probability matrix with nonnegative elements. A related work for deterministic dynamical systems is Extended Dynamic Mode Decomposition with dictionary learning [13]. All of these methods are purely analytic, i.e. they learn a reduced model of the dynamical system underlying the observed time series, but they miss a generative part that could be used to sample new time series in the high-dimensional configuration space. Recently, several learning frameworks for dynamical systems have been proposed that partially address the second goal by including a decoder from the latent space back to the space of input features. Most of these methods primarily aim at obtaining a low-dimensional latent space that encodes the long-time behavior of the system, and the decoder takes the role of defining or regularizing the learning problem [30, 8, 14, 19, 23]. In particular none of these models have demonstrated the ability to generate viable structures in the high-dimensional configuration space, such as a molecular structure with realistic atom positions in 3D. Finally, some of these models learn a linear model of the long-timescale dynamics [14, 19], but none of them provide a probabilistic dynamical model that can be employed in a Bayesian framework. Learning the correct long-time dynamical behavior with a generative dynamical model is difficult, as demonstrated in [8]. Here, we address these aforementioned gaps by providing a deep learning framework that learns, based on time-series data, the following components: 1. Probabilistic encodings of the input configuration to a low-dimensional latent space by neural networks, xt χ(xt). 2. A true transition probability matrix K describing the system dynamics in latent space for a fixed time-lag τ: E [χ(xt+τ)] = E K (τ)χ(xt) . The probabilistic nature of the method allows us to train it with likelihood maximization and embed it into a Bayesian framework. In our benchmarks, the transition probability matrix approximates the long-time behavior of the underlying dynamical system with high accuracy. 3. A generative model from latent vectors back to configurations, allowing us to sample the transition density P(xt+τ|xt) and thus propagate the model in configuration space. We show for the first time that this allows us to sample genuinely new and valid molecular structures that have not been included in the training data. This makes the method promising for performing active learning in MD [2, 21], and to predict the future evolution of the system in other contexts. 2 Deep Generative Markov State Models Given two configurations x, y Rd, where Rd is a potentially high-dimensional space of system configurations (e.g. the positions of atoms in a molecular system), Markovian dynamics are defined by the transition density P(xt+τ = y|xt = x). Here we represent the transition density between m states in the following form (Fig. 1): P(xt+τ = y|xt = x) = χ(x) q(y; τ) = i=1 χi(x)qi(y; τ). (1) Here, χ(x) = [χ1(x), ..., χm(x)] represent the probability of configuration x to be in a metastable (long-lived) state i χi(x) = P(xt state i | xt = x). Consequently, these functions are nonnegative (χi(x) 0 x) and sum up to one (P i χi(x) = 1 x). The functions χ(x) can, e.g., be represented by a neural network mapping from Rd to Rm with a 𝜒𝑥𝑡+𝜏 𝑥𝑡+𝜏 q Encoder Generator Markov model Deep Generative MSM Rewiring Trick 𝜒 𝑥𝑡+𝜏 𝜒 𝜒𝑥𝑡 Markov model sample expectation Figure 1: Schematic of Deep Generative Markov State Models (Deep Gen MSMs) and the rewiring trick. The function χ, here represented by neural networks, maps the time-lagged input configurations to metastable states whose dynamics are governed by a transition probability matrix K. The generator samples the distribution xt+τ q by employing a generative network that can produce novel configurations (or by resampling xt+τ in Deep Resample MSMs). The rewiring trick consists of reconnecting the probabilistic networks q and χ such that the time propagation in latent space can be sampled: From the latent state χ(xt), we generate a time-lagged configuration xt+τ using q, and then transform it back to the latent space, χ(xt+τ). Each application of the rewired network samples the latent space transitions, thus providing the statistics to estimate the Markov model transition matrix K(τ), which is needed for analysis. This trick allows K(τ) to be estimated with desired constraints, such as detailed balance. Soft Max output layer. Additionally, we have the probability densities qi(y; τ) = P(xt+τ = y|xt state i) that define the probability density of the system to land at configuration y after making one timestep. We thus briefly call them landing densities . 2.1 Kinetics Before addressing how to estimate χ and q from data, we describe how to perform the standard calculations and analyses that are common in the Markov modeling field for a model of the form (1). In Markov modeling, one is typically interested in the kinetics of the system, i.e. the long-time behavior of the dynamics. This is captured by the elements of the transition matrix K = [kij] between metastable states. K can be computed as follows: the product of the probability density to jump from metastable i to a configuration y and the probability that this configuration belongs to metastable state j, integrated over the whole configuration space. y qi(y; τ)χj(y) dy. (2) Practically, this calculation is implemented via the rewiring trick shown in Fig. 1, where the configuration space integral is approximated by drawing samples from the generator. The estimated probabilistic functions q and χ define, by construction, a valid transition probability matrix K, i.e. kij 0 and P j kij = 1. As a result, the proposed models have a structural advantage over other high-accuracy Markov state modeling approaches that define metastable states in a fuzzy or probabilistic manner but do not guarantee a valid transition matrix [12, 15] (See Supplementary Material for more details.). The stationary (equilibrium) probabilities of the metastable states are given by the vector π = [πi] that solves the eigenvalue problem with eigenvalue λ1 = 1: π = K π, (3) and the stationary (equilibrium) distribution in configuration space is given by: i πiqi(y; τ) = π q(y; τ). (4) Finally, for a fixed definition of states via χ, the self-consistency of Markov models may be tested using the Chapman-Kolmogorov equation Kn(τ) K(nτ) (5) which involves estimating the functions q(y; nτ) at different lag times nτ and comparing the resulting transition matrices with the nth power of the transition matrix obtained at lag time τ. A consequence of Eq. (5) is that the relaxation times ti(τ) = τ log |λi(τ)| (6) are independent of the lag time τ at which K is estimated [27]. Here, λi with i = 2, ..., m are the nontrivial eigenvalues of K. 2.2 Maximum Likelihood (ML) learning of Deep Resample MSM Given trajectories {xt}t=1,...,T , how do we estimate the membership probabilities χ(x), and how do we learn and sample the landing densities q(y)? We start with a model, where q(y) are directly derived from the observed (empirical) observations, i.e. they are point densities on the input configurations {xt}, given by: γi γi(y)ρ(y). (7) Here, ρ(y) is the empirical distribution, which in the case of finite sample size is simply ρ(y) = 1 T τ PT τ t=1 δ(y xt+τ), and γi(y) is a trainable weighting function. The normalization factor γi = 1 T τ PT τ t=1 γi(xt+τ) = Ey ρ1[γi(y)] ensures R y qi(y) dy = 1. Now we can optimize χi and γi by maximizing the likelihood (ML) of generating the pairs (xt, xt+τ) observed in the data. The log-likelihood is given by: i=1 χi(xt) γ 1 i γi(xt+τ) and is maximized to train a deep MSM with the structure shown in Fig. 1. Alternatively, we can optimize χi and γi using the Variational Approach for Markov Processes (VAMP) [31]. However, we found the ML approach to perform significantly better in our tests, and we thus include the VAMP training approach only in the Supplementary Material without elaborating on it further. Given the networks χ and γ, we compute q from Eq. (7). Employing the rewiring trick shown in Fig. 1 results in computing the transition matrix by a simple average over all configurations: t=τ q(xt+τ)χ(xt+τ) . (9) The deep MSMs described in this section are neural network generalizations of traditional MSMs they learn a mapping from configurations to metastable states, where they aim obtaining a good approximation of the kinetics of the underlying dynamical system, by means of the transition matrix K. However, since the landing distribution q in these methods is derived from the empirical distribution (7), any generated trajectory will only resample configurations from the input data. To highlight this property, we will refer to the deep MSMs with the present methodology as Deep Resample MSM. 2.3 Energy Distance learning of Deep Gen MSM In contrast to Deep Resample MSM, we now want to learn deep generative MSM (Deep Gen MSM), which can be used to generate trajectories that do not only resample from input data, but can produce genuinely new configurations. To this end, we train a generative model to mimic the empirical distribution qi(y): y = G(ei, ǫ), (10) where the vector ei Rm is a one-hot encoding of the metastable state, and ǫ is a i.i.d. random vector where each component samples from a Gaussian normal distribution. Here we train the generator G by minimizing the conditional Energy Distance (ED), whose choice is motivated in the Supplementary Material. The standard ED, introduced in [28], is a metric between the distributions of random vectors, defined as DE (P(x), P(y)) = E [2 x y x x y y ] (11) for two real-valued random variables x and y. x, x , y, y are independently distributed according to the distributions of x, y. Based on this metric, we introduce the conditional energy distance between the transition density of the system and that of the generative model: D E [DE (P(xt+τ|xt), P(ˆxt+τ|xt)) |xt] = E 2 ˆxt+τ xt+τ ˆxt+τ ˆx t+τ xt+τ x t+τ (12) Here xt+τ and x t+τ are distributed according to the transition density for given xt and ˆxt+τ, ˆx t+τ are independent outputs of the generative model conditioned on xt. Implementing the expectation value with an empirical average results in an estimate for D that is unbiased, up to an additive constant. We train G to minimize D. See Supplementary Material for detailed derivations and the training algorithm used. After training, the transition matrix can be obtained by using the rewiring trick (Fig. 1), where the configuration space integral is sampled by generating samples from the generator: [K]ij = Eǫ [χj (G(ei, ǫ))] . (13) Below we establish our framework by applying it to two well-defined benchmark systems that exhibit metastable stochastic dynamics. We validate the stationary distribution and kinetics by computing χ(x), q(y), the stationary distribution µ(y) and the relaxation times ti(τ) and comparing them with reference solutions. We will also test the abilities of Deep Gen MSMs to generate physically valid molecular configurations. The networks were implemented using Py Torch [20] and Tensorflow [6]. For the full code and all details about the neural network architecture, hyper-parameters and training algorithm, please refer to https://github.com/markovmodel/deep_gen_msm. 3.1 Diffusion in Prinz potential We first apply our framework to the time-discretized diffusion process xt+ t = t V (xt) + 2 t ηt with t = 0.01 in the Prinz potential V (xt) introduced in [22] (Fig. 2a). For this system we know exact results for benchmarking: the stationary distribution and relaxation timescales (black lines in Fig. 2b,c) and the transition density (Fig. 2d). We simulate trajectories of lengths 250, 000 and 125, 000 time steps for training and validation, respectively. For all methods, we repeat the data generation and model estimation process 10 times and compute mean and standard deviations for all quantities of interest, which thus represent the mean and variance of the estimators. The functions χ, γ and G are represented with densely connected neural networks. The details of the architecture and the training procedure can be found in the Supplementary Information. We compare Deep Resample MSMs and Deep Gen MSMs with standard MSMs using four or ten states obtained with k-means clustering. Note that standard MSMs do not directly operate on configuration space. When using an MSM, the transition density (Eq. 1) is thus simulated by: xt χ(xt) i Ki, j ρj(y) xt+τ, i.e., we find the cluster i associated with a configuration xt, which is deterministic for regular MSMs, then sample the cluster j at the next time-step, and sample from the conditional distribution of configurations in cluster j to generate xt+τ. Both Deep Resample MSMs trained with the ML method and standard MSMs can reproduce the stationary distribution within statistical uncertainty (Fig. 2b). For long lag times τ, all methods converge from below to the correct relaxation timescales (Fig. 2c), as expected from theory [22, 18]. When using equally many states (here: four), the Deep Resample MSM has a much lower bias in the relaxation timescales than the standard MSM. This is expected from approximation theory, as the Deep Resample MSMs represents the four metastable states with a meaningful, smooth membership functions χ(xt), while the four-state MSM cuts the memberships hard at boundaries with low sample density (Supplementary Fig. 1). When increasing the number of metastable states, the bias of all estimators will reduce. An MSM with ten states is needed to perform approximately equal to a four-state Deep Resample MSM (Fig. 2c). All subsequent analyses use a lag time of τ = 5. Figure 2: Performance of deep versus standard MSMs for diffusion in the Prinz Potential. (a) Potential energy as a function of position x. (b) Stationary distribution estimates of all methods with the exact distribution (black). (c) Implied timescales of the Prinz potential compared to the real ones (black line). (d) True transition density and approximations using maximum likelihood (ML) Deep Resample MSM, four and ten state MSMs. (e) KL-divergence of the stationary and transition distributions with respect to the true ones for all presented methods (also Deep Gen MSM). Figure 3: Performance of Deep Gen MSMs for diffusion in the Prinz Potential. Comparison between exact reference (black), Deep Gen MSMs estimated using only energy distance (ED) or combined ML-ED training. (a) Stationary distribution. (b-d) Transition densities. (e) Relaxation timescales. The Deep Resample MSM generates a transition density that is very similar to the exact density, while the MSM transition densities are coarse-grained by virtue of the fact that χ(xt) performs a hard clustering in an MSM (Fig. 2d). This impression is confirmed when computing the Kullback Leibler divergence of the distributions (Fig. 2e). Encouraged by the accurate results of Deep Resample MSMs, we now train Deep Gen MSM, either by training both the χ and q networks by minimizing the energy distance (ED), or by taking χ from a ML-trained Deep Resample MSM and only training the q network by minimizing the energy distance (ML-ED). The stationary densities, relaxation timescales and transition densities can still be approximated in these settings, although the Deep Gen MSMs exhibit larger statistical fluctuations than the resampling MSMs (Fig. 3). ML-ED appears to perform slightly better than ED alone, likely because reusing χ from the ML training makes the problem of training the generator easier. For a one-dimensional example like the Prinz potential, learning a generative model does not provide any added value, as the distributions can be well approximated by the empirical distributions. The fact that we can still get approximately correct results for stationary, kinetics and dynamical properties encourages us to use Deep Gen MSMs for a higher-dimensional example, where the generation of configurations is a hard problem. 3.2 Alanine dipeptide We use explicit-solvent MD simulations of Alanine dipeptide as a second example. Our aim is the learn stationary and kinetic properties, but especially to learn a generative model that generates genuinely novel but physically meaningful configurations. One 250 ns trajectory with a storage interval of 1 ps is used and split 80%/20% for training and validation see [15] for details of the simulation setup. We characterize all structures by the three-dimensional Cartesian coordinates of the heavy atoms, resulting in a 30 dimensional configuration space. While we do not have exact results for Alanine dipeptide, the system is small enough and well enough sampled, such that high- Figure 4: Performance of Deep Resample MSM and Deep Gen MSMs versus standard MSMs on the Alanine dipeptide simulation trajectory. (a) Data distribution and stationary distributions from reference MSM, Deep Resample MSM, and Deep Gen MSM. (b) State classification by Deep Resample MSM (c) Relaxation timescales. quality estimates of stationary and kinetic properties can be obtained from a very fine MSM [22]. We therefore define an MSM build on 400 equally sized grid areas in the (φ, ψ)-plane as a reference at a lag time of τ = 25 ps that has been validated by established methods [22]. Neural network and training details are again found at the git repository and in the Supplementary Information. For comparison with deep MSMs, we build two standard MSMs following a state of the art protocol: we transform input configurations with a kinetic map preserving 95% of the cumulative kinetic variance [17], followed by k-means clustering, where k = 6 and k = 100 are used. Deep Resample MSM trained with ML method approximate the stationary distribution very well (Fig. 4a). The reference MSM assigns a slightly lower weight to the lowest-populated state 6, but otherwise the data, reference distribution and deep MSM distribution are visually indistinguishable. The relaxation timescales estimated by a six-state Deep Resample MSM are significantly better than with six-state standard MSMs. MSMs with 100 states have a similar performance as the deep MSMs but this comes at the cost of a model with a much larger latent space. Finally, we test Deep Gen MSMs for Alanine dipeptide where χ is trained with the ML method and the generator is then trained using ED (ML-ED). The stationary distribution generated by simulating the Deep Gen MSM recursively results in a stationary distribution which is very similar to the reference distribution in states 1-4 with small φ values (Fig. 4a). States number 5 and 6 with large φ values are captured, but their shapes and weights are somewhat distorted (Fig. 4a). The one-step transition densities predicted by the generator are high quality for all states (Suppl. Fig. 2), thus the differences observed for the stationary distribution must come from small errors made in the transitions between metastable states that are very rarely observed for states 5 and 6. These rare events result in poor training data for the generator. However, the Deep Gen MSMs approximates the kinetics well within the uncertainty that is mostly due to estimator variance (Fig. 4c). Now we ask whether Deep Gen MSMs can sample valid structures in the 30-dimensional configuration space, i.e., if the placement of atoms is physically meaningful. As we generate configurations in Cartesian space, we first check if the internal coordinates are physically viable by comparing all bond lengths and angles between real MD data and generated trajectories (Fig. 5). The true bond lengths and angles are almost perfectly Gaussian distributed, and we thus normalize them by shifting each distribution to a mean of 0 and scaling it to have standard deviation 1, which results all reference distributions to collapse to a normal distribution (Fig. 5a,c). We normalize the generated distribution with the mean and standard distribution of the true data. Although there are clear differences (Fig. 5b,d), these distributions are very encouraging. Bonds and angles are very stiff degrees of freedom, and the fact that most differences in mean and standard deviation are small when compared to the true fluctuation width means that the generated structures are close to physically accurate and could be refined by little additional MD simulation effort. Finally, we perform an experiment to test whether the Deep Gen MSM is able to generate genuinely new configurations that do exist for Alanine dipeptide but have not been seen in the training data. In other words, can the generator extrapolate in a meaningful way? This is a fundamental question, because simulating MD is exorbitantly expensive, with each simulation time step being computationally expensive but progressing time only of the order of 10 15 seconds, while often total simulation Figure 5: Normalized bond (a,b) and angle (c,d) distributions of Alanine dipeptide compared to Gaussian normal distribution (black). (a,c) True MD data. (b,d) Data from trajectories generated by Deep Gen MSMs. Figure 6: Deep Gen MSMs can generate physically realistic structures in areas that were not included in the training data. (a) Distribution of training data. (b) Generated stationary distribution. (c) Representative real molecular configuration (from MD simulation) in each of the metastable states (sticks and balls), and the 100 closest configurations generated by the Deep Gen MSM (lines). timescales of 10 3 seconds or longer are needed. A Deep Gen MSM that makes leaps of length τ orders of magnitude larger than the MD simulation time-step and has even a small chance of generating new and meaningful structures would be extremely valuable to discover new states and thereby accelerate MD sampling. To test this ability, we conduct six experiments, in each of which we remove all data belonging to one of the six metastable states of Alanine dipeptide (6a). We train a Deep Gen MSM with each of these datasets separately, and simulate it to predict the stationary distribution (6b). While the generated stationary distributions are skewed and the shape of the distribution in the (φ, ψ) range with missingdata are not quantitatively predicted, the Deep Gen MSMs do indeed predict configurations where no training data was present (6b). Surprisingly, the quality of most of these configurations is high (6c). While the structures of the two low-populated states 5-6 do not look realistic, each of the metastable states 1-4 are generated with high quality, as shown by the overlap of a real MD structure and the 100 most similar generated structures (6c). In conclusion, deep MSMs provide high-quality models of the stationary and kinetic properties for stochastic dynamical systems such as MD simulations. In contrast to other high-quality models such as VAMPnets, the resulting model is truly probabilistic and can thus be physically interpreted and be used in a Bayesian framework. For the first time, it was shown that generating dynamical trajectories in a 30-dimensional molecular configuration space results in sampling of physically realistic molecular structures. While Alanine dipeptide is a small system compared to proteins and other macromolecules that are of biological interest, our results demonstrate that efficient sampling of new molecular structures is possible with generative dynamic models, and improved methods can be built upon this. Future methods will especially need to address the difficulties of generating valid configurations in low-probability regimes, and it is likely that the energy distance used here for generator training needs to be revisited to achieve this goal. Acknowledgements This work was funded by the European Research Commission (ERC Co G Scale Cell ), Deutsche Forschungsgemeinschaft (CRC 1114/A04, Transregio 186/A12, NO 825/4 1, Dynlon P8), and the 1000-Talent Program of Young Scientists in China . [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214 223, 2017. [2] G. R. Bowman, D. L. Ensign, and V. S. Pande. Enhanced Modeling via Network Theory: Adaptive Sampling of Markov State Models. J. Chem. Theory Comput., 6(3):787 794, 2010. [3] G. R. Bowman, V. S. Pande, and F. Noé, editors. An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation., volume 797 of Advances in Experimental Medicine and Biology. Springer Heidelberg, 2014. [4] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USAP, 113:3932 3937. [5] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015. [6] Martín Abadi et al. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org. [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [8] C. X. Hernández, H. K. Wayment-Steele, M. M. Sultan, B. E. Husic, and V. S. Pande. Variational encoding of complex dynamics. ar Xiv:1711.08576, 2017. [9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015. [10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Co RR, abs/1412.6980, 2014. [11] Milan Korda and Igor Mezic. On convergence of extended dynamic mode decomposition to the koopman operator. J. Nonlinear Sci., 28:687 710, 2017. [12] S. Kube and M. Weber. A coarse graining method for the identification of transition rates between molecular conformations. J. Chem. Phys., 126:024103, 2007. [13] Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis. Extended dynamic mode decomposition with dictionary learning: a data-driven adaptive spectral decomposition of the koopman operator. Chaos, 27:103111, 2017. [14] B. Lusch and S. L. Brunton J . N. Kutz. Deep learning for universal linear embeddings of nonlinear dynamics. ar Xiv:1712.09707, 2017. [15] A. Mardt, L. Pasquali, H. Wu, and F. Noé. Vampnets: Deep learning of molecular kinetics. Nat. Commun., 9:5, 2018. [16] I. Mezi c. Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dynam., 41:309 325, 2005. [17] F. Noé and C. Clementi. Kinetic distance and kinetic maps from molecular dynamics simulation. J. Chem. Theory Comput., 11:5002 5011, 2015. [18] F. Noé and F. Nüske. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Model. Simul., 11:635 655, 2013. [19] S. E. Otto and C. W. Rowley. Linearly-recurrent autoencoder networks for learning dynamics. ar Xiv:1712.01378, 2017. [20] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. [21] N. Plattner, S. Doerr, G. De Fabritiis, and F. Noé. Protein-protein association and binding mechanism resolved in atomic detail. Nat. Chem., 9:1005 1011, 2017. [22] J.-H. Prinz, H. Wu, M. Sarich, B. G. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte, and F. Noé. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys., 134:174105, 2011. [23] João Marcelo Lamim Ribeiro, Pablo Bravo, Yihang Wang, and Pratyush Tiwary. Reweighted autoencoded variational bayes for enhanced sampling (rave). J. Chem. Phys., 149:072301, 2018. [24] M. Sarich, F. Noé, and C. Schütte. On the approximation quality of markov state models. Multiscale Model. Simul., 8:1154 1177, 2010. [25] M. Sarich and C. Schütte. Metastability and Markov State Models in Molecular Dynamics. Courant Lecture Notes. American Mathematical Society, 2013. [26] P. J. Schmid and J. Sesterhenn. Dynamic mode decomposition of numerical and experimental data. In 61st Annual Meeting of the APS Division of Fluid Dynamics. American Physical Society, 2008. [27] W. C. Swope, J. W. Pitera, and F. Suits. Describing protein folding kinetics by molecular dynamics simulations: 1. Theory. J. Phys. Chem. B, 108:6571 6581, 2004. [28] G. Székely and M. Rizzo. Testing for equal distributions in high dimension. Inter Stat,, 5, 2004. [29] J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, and J. N. Kutz. On dynamic mode decomposition: Theory and applications. J. Comput. Dyn., 1(2):391 421, dec 2014. [30] C. Wehmeyer and F. Noé. Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. ar Xiv:1710.11239, 2017. [31] H. Wu and F. Noé. Variational approach for learning markov processes from time series data. ar Xiv:1707.04659, 2017.