{"title": "Deep Generative Markov State Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3975, "page_last": 3984, "abstract": "We propose a deep generative Markov State Model (DeepGenMSM) learning framework for inference of metastable dynamical systems and prediction of trajectories. After unsupervised training on time series data, the model contains (i) a probabilistic encoder that maps from high-dimensional configuration space to a small-sized vector indicating the membership to metastable (long-lived) states, (ii) a Markov chain that governs the transitions between metastable states and facilitates analysis of the long-time dynamics, and (iii) a generative part that samples the conditional distribution of configurations in the next time step. The model can be operated in a recursive fashion to generate trajectories to predict the system evolution from a defined starting state and propose new configurations. The DeepGenMSM is demonstrated to provide accurate estimates of the long-time kinetics and generate valid distributions for molecular dynamics (MD) benchmark systems. Remarkably, we show that DeepGenMSMs are able to make long time-steps in molecular configuration space and generate physically realistic structures in regions that were not seen in training data.", "full_text": "Deep Generative Markov State Models\n\nHao Wu1,2,\u2217, Andreas Mardt1,\u2217, Luca Pasquali1,\u2217, and Frank Noe1,\u2020\n\n1Dept. of Mathematics and Computer Science, Freie Universit\u00e4t Berlin, 14195 Berlin, Germany\n\n2School of Mathematical Sciences, Tongji University, Shanghai, 200092, P.R. China\n\nAbstract\n\nWe propose a deep generative Markov State Model (DeepGenMSM) learning\nframework for inference of metastable dynamical systems and prediction of tra-\njectories. After unsupervised training on time series data, the model contains (i)\na probabilistic encoder that maps from high-dimensional con\ufb01guration space to a\nsmall-sized vector indicating the membership to metastable (long-lived) states, (ii)\na Markov chain that governs the transitions between metastable states and facili-\ntates analysis of the long-time dynamics, and (iii) a generative part that samples\nthe conditional distribution of con\ufb01gurations in the next time step. The model\ncan be operated in a recursive fashion to generate trajectories to predict the sys-\ntem evolution from a de\ufb01ned starting state and propose new con\ufb01gurations. The\nDeepGenMSM is demonstrated to provide accurate estimates of the long-time ki-\nnetics and generate valid distributions for molecular dynamics (MD) benchmark\nsystems. Remarkably, we show that DeepGenMSMs are able to make long time-\nsteps in molecular con\ufb01guration space and generate physically realistic structures\nin regions that were not seen in training data.\n\n1\n\nIntroduction\n\nComplex dynamical systems that exhibit events on vastly different timescales are ubiquitous in sci-\nence and engineering. For example, molecular dynamics (MD) of biomolecules involve fast vibra-\ntions on the timescales of 10\u221215 seconds, while their biological function is often related to the rare\nswitching events between long-lived states on timescales of 10\u22123 seconds or longer. In weather and\nclimate systems, local \ufb02uctuations in temperature and pressure \ufb01elds occur within minutes or hours,\nwhile global changes are often subject to periodic motion and drift over years or decades. Primary\ngoals in the analysis of complex dynamical systems include:\n\n1. Deriving an interpretable model of the essential long-time dynamical properties of these\n\nsystems, such as the stationary behavior or lifetimes/cycle times of slow processes.\n\n2. Simulating the dynamical system, e.g., to predict the system\u2019s future evolution or to sample\n\npreviously unobserved system con\ufb01gurations.\n\nA state-of-the-art approach for the \ufb01rst goal is to learn a Markovian model from time-series data,\nwhich is theoretically justi\ufb01ed by the fact that physical systems are inherently Markovian. In prac-\ntice, the long-time behavior of dynamical systems can be accurately described in a Markovian model\nwhen suitable features or variables are used, and when the time resolution of the model is suf\ufb01ciently\ncoarse such that the time-evolution can be represented with a manageable number of dynamical\nmodes [24, 11]. In stochastic dynamical systems, such as MD simulation, variants of Markov state\nmodels (MSMs) are commonly used [3, 25, 22]. In MSMs, the con\ufb01guration space is discretized,\ne.g., using a clustering method, and the dynamics between clusters are then described by a matrix\n\n\u2217H. Wu, A. Mardt and L. Pasquali equally contributed to this work.\n\u2020Author to whom correspondence should be addressed. Electronic mail: frank.noe@fu-berlin.de.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fof transition probabilities [22]. The analogous approach for deterministic dynamical systems such\nas complex \ufb02uid \ufb02ows is called Koopman analysis, where time propagation is approximated by a\nlinear model in a suitable function space transformation of the \ufb02ow variables [16, 26, 29, 4]. The\nrecently proposed VAMPnets learn an optimal feature transformation from full con\ufb01guration space\nto a low-dimensional latent space in which the Markovian model is built by variational optimization\nof a neural network [15]. When the VAMPnet has a probabilistic output (e.g. SoftMax layer), the\nMarkovian model conserves probability, but is not guaranteed to be a valid transition probability\nmatrix with nonnegative elements. A related work for deterministic dynamical systems is Extended\nDynamic Mode Decomposition with dictionary learning [13]. All of these methods are purely ana-\nlytic, i.e. they learn a reduced model of the dynamical system underlying the observed time series,\nbut they miss a generative part that could be used to sample new time series in the high-dimensional\ncon\ufb01guration space.\n\nRecently, several learning frameworks for dynamical systems have been proposed that partially ad-\ndress the second goal by including a decoder from the latent space back to the space of input features.\nMost of these methods primarily aim at obtaining a low-dimensional latent space that encodes the\nlong-time behavior of the system, and the decoder takes the role of de\ufb01ning or regularizing the learn-\ning problem [30, 8, 14, 19, 23]. In particular none of these models have demonstrated the ability to\ngenerate viable structures in the high-dimensional con\ufb01guration space, such as a molecular struc-\nture with realistic atom positions in 3D. Finally, some of these models learn a linear model of the\nlong-timescale dynamics [14, 19], but none of them provide a probabilistic dynamical model that\ncan be employed in a Bayesian framework. Learning the correct long-time dynamical behavior with\na generative dynamical model is dif\ufb01cult, as demonstrated in [8].\n\nHere, we address these aforementioned gaps by providing a deep learning framework that learns,\nbased on time-series data, the following components:\n\n1. Probabilistic encodings of the input con\ufb01guration to a low-dimensional latent space by\n\n2. A true transition probability matrix K describing the system dynamics in latent space for a\n\nneural networks, xt \u2192 \u03c7(xt).\n\ufb01xed time-lag \u03c4 :\n\nE [\u03c7(xt+\u03c4 )] = E(cid:2)K\u22a4(\u03c4 )\u03c7(xt)(cid:3) .\n\nThe probabilistic nature of the method allows us to train it with likelihood maximization\nand embed it into a Bayesian framework. In our benchmarks, the transition probability\nmatrix approximates the long-time behavior of the underlying dynamical system with high\naccuracy.\n\n3. A generative model from latent vectors back to con\ufb01gurations, allowing us to sample the\ntransition density P(xt+\u03c4|xt) and thus propagate the model in con\ufb01guration space. We\nshow for the \ufb01rst time that this allows us to sample genuinely new and valid molecular\nstructures that have not been included in the training data. This makes the method promis-\ning for performing active learning in MD [2, 21], and to predict the future evolution of the\nsystem in other contexts.\n\n2 Deep Generative Markov State Models\n\nGiven two con\ufb01gurations x, y \u2208 Rd, where Rd is a potentially high-dimensional space of system\ncon\ufb01gurations (e.g. the positions of atoms in a molecular system), Markovian dynamics are de\ufb01ned\nby the transition density P(xt+\u03c4 = y|xt = x). Here we represent the transition density between m\nstates in the following form (Fig. 1):\n\nm\n\nP(xt+\u03c4 = y|xt = x) = \u03c7(x)\u22a4q(y; \u03c4 ) =\n\n\u03c7i(x)qi(y; \u03c4 ).\n\n(1)\n\nXi=1\n\nHere, \u03c7(x)\u22a4 = [\u03c71(x), ..., \u03c7m(x)] represent the probability of con\ufb01guration x to be in a metastable\n(long-lived) state i\n\n\u03c7i(x) = P(xt \u2208 state i | xt = x).\n\nConsequently, these functions are nonnegative (\u03c7i(x) \u2265 0\u2200x) and sum up to one (Pi \u03c7i(x) = 1\u2200x).\n\nThe functions \u03c7(x) can, e.g., be represented by a neural network mapping from Rd to Rm with a\n\n2\n\n\fFigure 1: Schematic of Deep Generative Markov State Models (DeepGenMSMs) and the rewiring\ntrick. The function \u03c7, here represented by neural networks, maps the time-lagged input con\ufb01gura-\ntions to metastable states whose dynamics are governed by a transition probability matrix K. The\ngenerator samples the distribution xt+\u03c4 \u223c q by employing a generative network that can produce\nnovel con\ufb01gurations (or by resampling xt+\u03c4 in DeepResampleMSMs). The rewiring trick consists\nof reconnecting the probabilistic networks q and \u03c7 such that the time propagation in latent space\ncan be sampled: From the latent state \u03c7(xt), we generate a time-lagged con\ufb01guration xt+\u03c4 using\nq, and then transform it back to the latent space, \u03c7(xt+\u03c4 ). Each application of the rewired net-\nwork samples the latent space transitions, thus providing the statistics to estimate the Markov model\ntransition matrix K(\u03c4 ), which is needed for analysis. This trick allows K(\u03c4 ) to be estimated with\ndesired constraints, such as detailed balance.\n\nSoftMax output layer. Additionally, we have the probability densities\nqi(y; \u03c4 ) = P(xt+\u03c4 = y|xt \u2208 state i)\n\nthat de\ufb01ne the probability density of the system to \u201cland\u201d at con\ufb01guration y after making one time-\nstep. We thus brie\ufb02y call them \u201clanding densities\u201d.\n\n2.1 Kinetics\n\nBefore addressing how to estimate \u03c7 and q from data, we describe how to perform the standard\ncalculations and analyses that are common in the Markov modeling \ufb01eld for a model of the form\n(1).\n\nIn Markov modeling, one is typically interested in the kinetics of the system, i.e.\nthe long-time\nbehavior of the dynamics. This is captured by the elements of the transition matrix K = [kij]\nbetween metastable states. K can be computed as follows: the product of the probability density\nto jump from metastable i to a con\ufb01guration y and the probability that this con\ufb01guration belongs to\nmetastable state j, integrated over the whole con\ufb01guration space.\n\nkij(\u03c4 ) =Zy\n\nqi(y; \u03c4 )\u03c7j(y) dy.\n\n(2)\n\nPractically, this calculation is implemented via the \u201crewiring trick\u201d shown in Fig. 1, where the\ncon\ufb01guration space integral is approximated by drawing samples from the generator. The estimated\nprobabilistic functions q and \u03c7 de\ufb01ne, by construction, a valid transition probability matrix K,\n\ni.e. kij \u2265 0 and Pj kij = 1. As a result, the proposed models have a structural advantage over\n\nother high-accuracy Markov state modeling approaches that de\ufb01ne metastable states in a fuzzy or\nprobabilistic manner but do not guarantee a valid transition matrix [12, 15] (See Supplementary\nMaterial for more details.).\n\nThe stationary (equilibrium) probabilities of the metastable states are given by the vector \u03c0 = [\u03c0i]\nthat solves the eigenvalue problem with eigenvalue \u03bb1 = 1:\n\nand the stationary (equilibrium) distribution in con\ufb01guration space is given by:\n\n\u03c0 = K\u22a4\n\n\u03c0,\n\n(3)\n\n(4)\n\n\u03c0iqi(y; \u03c4 ) = \u03c0\n\n\u22a4q(y; \u03c4 ).\n\n\u00b5(y) =Xi\n\nFinally, for a \ufb01xed de\ufb01nition of states via \u03c7, the self-consistency of Markov models may be tested\nusing the Chapman-Kolmogorov equation\n\nKn(\u03c4 ) \u2248 K(n\u03c4 )\n\n3\n\n(5)\n\n\ud835\udf12\ud835\udc65\ud835\udc61\ud835\udc65\ud835\udc61+\ud835\udf0f\ud835\udf12\ud835\udc65\ud835\udc61\ud835\udf12\ud835\udc65\ud835\udc61+\ud835\udf0f\u0de4\ud835\udc65\ud835\udc61+\ud835\udf0fqK(\u03c4)noise\ud835\udf12EncoderEncoderGeneratorMarkovmodelDeepGenerative MSMRewiringTrickq\u0de4\ud835\udc65\ud835\udc61+\ud835\udf0f\ud835\udf12\u0de4\ud835\udc65\ud835\udc61+\ud835\udf0f\ud835\udf12\ud835\udf12\ud835\udc65\ud835\udc61K(\u03c4)~Markovmodelsample expectation\fwhich involves estimating the functions q(y; n\u03c4 ) at different lag times n\u03c4 and comparing the re-\nsulting transition matrices with the nth power of the transition matrix obtained at lag time \u03c4 . A\nconsequence of Eq. (5) is that the relaxation times\n\nare independent of the lag time \u03c4 at which K is estimated [27]. Here, \u03bbi with i = 2, ..., m are the\nnontrivial eigenvalues of K.\n\nti(\u03c4 ) = \u2212\n\n\u03c4\n\nlog |\u03bbi(\u03c4 )|\n\n(6)\n\n(7)\n\n2.2 Maximum Likelihood (ML) learning of DeepResampleMSM\n\nGiven trajectories {xt}t=1,...,T , how do we estimate the membership probabilities \u03c7(x), and how\ndo we learn and sample the landing densities q(y)? We start with a model, where q(y) are di-\nrectly derived from the observed (empirical) observations, i.e. they are point densities on the input\ncon\ufb01gurations {xt}, given by:\n\nqi(y) =\n\n\u03b3i(y)\u03c1(y).\n\n1\n\u00af\u03b3i\n\n1\n\nT \u2212\u03c4 PT \u2212\u03c4\n\n\u00af\u03b3i = 1\n\nHere, \u03c1(y) is the empirical distribution, which in the case of \ufb01nite sample size is simply \u03c1(y) =\nt=1 \u03b4(y \u2212 xt+\u03c4 ), and \u03b3i(y) is a trainable weighting function. The normalization factor\nT \u2212\u03c4 PT \u2212\u03c4\n\nt=1 \u03b3i(xt+\u03c4 ) = Ey\u223c\u03c11 [\u03b3i(y)] ensuresRy qi(y) dy = 1.\n\nNow we can optimize \u03c7i and \u03b3i by maximizing the likelihood (ML) of generating the pairs\n(xt, xt+\u03c4 ) observed in the data. The log-likelihood is given by:\n\nLL =\n\nT \u2212\u03c4\n\nXt=1\n\nln m\nXi=1\n\n\u03c7i(xt)\u00af\u03b3\u22121\n\ni \u03b3i(xt+\u03c4 )! ,\n\n(8)\n\nand is maximized to train a deep MSM with the structure shown in Fig. 1.\n\nAlternatively, we can optimize \u03c7i and \u03b3i using the Variational Approach for Markov Processes\n(VAMP) [31]. However, we found the ML approach to perform signi\ufb01cantly better in our tests, and\nwe thus include the VAMP training approach only in the Supplementary Material without elaborat-\ning on it further.\n\nGiven the networks \u03c7 and \u03b3, we compute q from Eq. (7). Employing the rewiring trick shown in\nFig. 1 results in computing the transition matrix by a simple average over all con\ufb01gurations:\n\nK =\n\n1\nN\n\nT \u2212\u03c4\n\nXt=\u03c4\n\nq(xt+\u03c4 )\u03c7(xt+\u03c4 )\u22a4.\n\n(9)\n\nThe deep MSMs described in this section are neural network generalizations of traditional MSMs\n\u2013 they learn a mapping from con\ufb01gurations to metastable states, where they aim obtaining a good\napproximation of the kinetics of the underlying dynamical system, by means of the transition matrix\nK. However, since the landing distribution q in these methods is derived from the empirical distribu-\ntion (7), any generated trajectory will only resample con\ufb01gurations from the input data. To highlight\nthis property, we will refer to the deep MSMs with the present methodology as DeepResampleMSM.\n\n2.3 Energy Distance learning of DeepGenMSM\n\nIn contrast to DeepResampleMSM, we now want to learn deep generative MSM (DeepGenMSM),\nwhich can be used to generate trajectories that do not only resample from input data, but can produce\ngenuinely new con\ufb01gurations. To this end, we train a generative model to mimic the empirical\ndistribution qi(y):\n\n(10)\nwhere the vector ei \u2208 Rm is a one-hot encoding of the metastable state, and \u01eb is a i.i.d. random\nvector where each component samples from a Gaussian normal distribution.\n\ny = G(ei, \u01eb),\n\nHere we train the generator G by minimizing the conditional Energy Distance (ED), whose choice is\nmotivated in the Supplementary Material. The standard ED, introduced in [28], is a metric between\nthe distributions of random vectors, de\ufb01ned as\n\nDE (P(x), P(y)) = E [2kx \u2212 yk \u2212 kx \u2212 x\u2032k \u2212 ky \u2212 y\u2032k]\n\n(11)\n\n4\n\n\ffor two real-valued random variables x and y. x, x\u2032, y, y\u2032 are independently distributed according to\nthe distributions of x, y. Based on this metric, we introduce the conditional energy distance between\nthe transition density of the system and that of the generative model:\n\nD , E [DE (P(xt+\u03c4|xt), P(\u02c6xt+\u03c4|xt))|xt]\n= E(cid:2)2k\u02c6xt+\u03c4 \u2212 xt+\u03c4k \u2212(cid:13)(cid:13)\u02c6xt+\u03c4 \u2212 \u02c6x\u2032\n\nt+\u03c4(cid:13)(cid:13) \u2212(cid:13)(cid:13)xt+\u03c4 \u2212 x\u2032\n\nt+\u03c4(cid:13)(cid:13)(cid:3)\n\nt+\u03c4 are distributed according to the transition density for given xt and \u02c6xt+\u03c4 , \u02c6x\u2032\n\nHere xt+\u03c4 and x\u2032\nt+\u03c4\nare independent outputs of the generative model conditioned on xt. Implementing the expectation\nvalue with an empirical average results in an estimate for D that is unbiased, up to an additive\nconstant. We train G to minimize D. See Supplementary Material for detailed derivations and the\ntraining algorithm used.\n\n(12)\n\nAfter training, the transition matrix can be obtained by using the rewiring trick (Fig. 1), where the\ncon\ufb01guration space integral is sampled by generating samples from the generator:\n\n[K]ij = E\u01eb [\u03c7j (G(ei, \u01eb))] .\n\n(13)\n\n3 Results\n\nBelow we establish our framework by applying it to two well-de\ufb01ned benchmark systems that ex-\nhibit metastable stochastic dynamics. We validate the stationary distribution and kinetics by comput-\ning \u03c7(x), q(y), the stationary distribution \u00b5(y) and the relaxation times ti(\u03c4 ) and comparing them\nwith reference solutions. We will also test the abilities of DeepGenMSMs to generate physically\nvalid molecular con\ufb01gurations.\n\nThe networks were implemented using PyTorch [20] and Tensor\ufb02ow [6]. For the full code and all\ndetails about the neural network architecture, hyper-parameters and training algorithm, please refer\nto https://github.com/markovmodel/deep_gen_msm.\n\n3.1 Diffusion in Prinz potential\n\nWe \ufb01rst apply our framework to the time-discretized diffusion process xt+\u2206t = \u2212\u2206t\u2207V (xt) +\n\u221a2\u2206t \u02d9\u03b7t with \u2206t = 0.01 in the Prinz potential V (xt) introduced in [22] (Fig. 2a). For this system\nwe know exact results for benchmarking: the stationary distribution and relaxation timescales (black\nlines in Fig. 2b,c) and the transition density (Fig. 2d). We simulate trajectories of lengths 250, 000\nand 125, 000 time steps for training and validation, respectively. For all methods, we repeat the data\ngeneration and model estimation process 10 times and compute mean and standard deviations for\nall quantities of interest, which thus represent the mean and variance of the estimators.\n\nThe functions \u03c7, \u03b3 and G are represented with densely connected neural networks. The details of\nthe architecture and the training procedure can be found in the Supplementary Information.\n\nWe compare DeepResampleMSMs and DeepGenMSMs with standard MSMs using four or ten\nstates obtained with k-means clustering. Note that standard MSMs do not directly operate on con-\n\ufb01guration space. When using an MSM, the transition density (Eq. 1) is thus simulated by:\n\nxt\n\n\u03c7(xt)\n\n\u2212\u2192 i\n\n\u223cKi,\u2217\u2212\u2192 j\n\n\u223c\u03c1j (y)\n\n\u2212\u2192 xt+\u03c4 ,\n\ni.e., we \ufb01nd the cluster i associated with a con\ufb01guration xt, which is deterministic for regular MSMs,\nthen sample the cluster j at the next time-step, and sample from the conditional distribution of\ncon\ufb01gurations in cluster j to generate xt+\u03c4 .\n\nBoth DeepResampleMSMs trained with the ML method and standard MSMs can reproduce the\nstationary distribution within statistical uncertainty (Fig. 2b). For long lag times \u03c4 , all methods con-\nverge from below to the correct relaxation timescales (Fig. 2c), as expected from theory [22, 18].\nWhen using equally many states (here: four), the DeepResampleMSM has a much lower bias in the\nrelaxation timescales than the standard MSM. This is expected from approximation theory, as the\nDeepResampleMSMs represents the four metastable states with a meaningful, smooth membership\nfunctions \u03c7(xt), while the four-state MSM cuts the memberships hard at boundaries with low sam-\nple density (Supplementary Fig. 1). When increasing the number of metastable states, the bias of\nall estimators will reduce. An MSM with ten states is needed to perform approximately equal to a\nfour-state DeepResampleMSM (Fig. 2c). All subsequent analyses use a lag time of \u03c4 = 5.\n\n5\n\n\fFigure 2: Performance of deep versus standard MSMs for diffusion in the Prinz Potential.\n(a)\nPotential energy as a function of position x. (b) Stationary distribution estimates of all methods\nwith the exact distribution (black). (c) Implied timescales of the Prinz potential compared to the real\nones (black line). (d) True transition density and approximations using maximum likelihood (ML)\nDeepResampleMSM, four and ten state MSMs. (e) KL-divergence of the stationary and transition\ndistributions with respect to the true ones for all presented methods (also DeepGenMSM).\n\nFigure 3: Performance of DeepGenMSMs for diffusion in the Prinz Potential. Comparison between\nexact reference (black), DeepGenMSMs estimated using only energy distance (ED) or combined\nML-ED training. (a) Stationary distribution. (b-d) Transition densities. (e) Relaxation timescales.\n\nThe DeepResampleMSM generates a transition density that is very similar to the exact density,\nwhile the MSM transition densities are coarse-grained by virtue of the fact that \u03c7(xt) performs a\nhard clustering in an MSM (Fig. 2d). This impression is con\ufb01rmed when computing the Kullback-\nLeibler divergence of the distributions (Fig. 2e).\n\nEncouraged by the accurate results of DeepResampleMSMs, we now train DeepGenMSM, either\nby training both the \u03c7 and q networks by minimizing the energy distance (ED), or by taking \u03c7\nfrom a ML-trained DeepResampleMSM and only training the q network by minimizing the energy\ndistance (ML-ED). The stationary densities, relaxation timescales and transition densities can still\nbe approximated in these settings, although the DeepGenMSMs exhibit larger statistical \ufb02uctuations\nthan the resampling MSMs (Fig. 3). ML-ED appears to perform slightly better than ED alone, likely\nbecause reusing \u03c7 from the ML training makes the problem of training the generator easier.\n\nFor a one-dimensional example like the Prinz potential, learning a generative model does not provide\nany added value, as the distributions can be well approximated by the empirical distributions. The\nfact that we can still get approximately correct results for stationary, kinetics and dynamical proper-\nties encourages us to use DeepGenMSMs for a higher-dimensional example, where the generation\nof con\ufb01gurations is a hard problem.\n\n3.2 Alanine dipeptide\n\nWe use explicit-solvent MD simulations of Alanine dipeptide as a second example. Our aim is\nthe learn stationary and kinetic properties, but especially to learn a generative model that generates\ngenuinely novel but physically meaningful con\ufb01gurations. One 250 ns trajectory with a storage\ninterval of 1 ps is used and split 80%/20% for training and validation \u2013 see [15] for details of the\nsimulation setup. We characterize all structures by the three-dimensional Cartesian coordinates of\nthe heavy atoms, resulting in a 30 dimensional con\ufb01guration space. While we do not have exact\nresults for Alanine dipeptide, the system is small enough and well enough sampled, such that high-\n\n6\n\n\fFigure 4: Performance of DeepResampleMSM and DeepGenMSMs versus standard MSMs on the\nAlanine dipeptide simulation trajectory. (a) Data distribution and stationary distributions from ref-\nerence MSM, DeepResampleMSM, and DeepGenMSM. (b) State classi\ufb01cation by DeepResam-\npleMSM (c) Relaxation timescales.\n\nquality estimates of stationary and kinetic properties can be obtained from a very \ufb01ne MSM [22].\nWe therefore de\ufb01ne an MSM build on 400 equally sized grid areas in the (\u03c6, \u03c8)-plane as a reference\nat a lag time of \u03c4 = 25 ps that has been validated by established methods [22].\n\nNeural network and training details are again found at the git repository and in the Supplementary\nInformation.\n\nFor comparison with deep MSMs, we build two standard MSMs following a state of the art protocol:\nwe transform input con\ufb01gurations with a kinetic map preserving 95% of the cumulative kinetic\nvariance [17], followed by k-means clustering, where k = 6 and k = 100 are used.\n\nDeepResampleMSM trained with ML method approximate the stationary distribution very well (Fig.\n4a). The reference MSM assigns a slightly lower weight to the lowest-populated state 6, but other-\nwise the data, reference distribution and deep MSM distribution are visually indistinguishable. The\nrelaxation timescales estimated by a six-state DeepResampleMSM are signi\ufb01cantly better than with\nsix-state standard MSMs. MSMs with 100 states have a similar performance as the deep MSMs but\nthis comes at the cost of a model with a much larger latent space.\n\nFinally, we test DeepGenMSMs for Alanine dipeptide where \u03c7 is trained with the ML method and\nthe generator is then trained using ED (ML-ED). The stationary distribution generated by simulat-\ning the DeepGenMSM recursively results in a stationary distribution which is very similar to the\nreference distribution in states 1-4 with small \u03c6 values (Fig. 4a). States number 5 and 6 with large\n\u03c6 values are captured, but their shapes and weights are somewhat distorted (Fig. 4a). The one-step\ntransition densities predicted by the generator are high quality for all states (Suppl. Fig. 2), thus\nthe differences observed for the stationary distribution must come from small errors made in the\ntransitions between metastable states that are very rarely observed for states 5 and 6. These rare\nevents result in poor training data for the generator. However, the DeepGenMSMs approximates the\nkinetics well within the uncertainty that is mostly due to estimator variance (Fig. 4c).\n\nNow we ask whether DeepGenMSMs can sample valid structures in the 30-dimensional con\ufb01gura-\ntion space, i.e., if the placement of atoms is physically meaningful. As we generate con\ufb01gurations\nin Cartesian space, we \ufb01rst check if the internal coordinates are physically viable by comparing\nall bond lengths and angles between real MD data and generated trajectories (Fig. 5). The true\nbond lengths and angles are almost perfectly Gaussian distributed, and we thus normalize them by\nshifting each distribution to a mean of 0 and scaling it to have standard deviation 1, which results\nall reference distributions to collapse to a normal distribution (Fig. 5a,c). We normalize the gener-\nated distribution with the mean and standard distribution of the true data. Although there are clear\ndifferences (Fig. 5b,d), these distributions are very encouraging. Bonds and angles are very stiff de-\ngrees of freedom, and the fact that most differences in mean and standard deviation are small when\ncompared to the true \ufb02uctuation width means that the generated structures are close to physically\naccurate and could be re\ufb01ned by little additional MD simulation effort.\n\nFinally, we perform an experiment to test whether the DeepGenMSM is able to generate genuinely\nnew con\ufb01gurations that do exist for Alanine dipeptide but have not been seen in the training data. In\nother words, can the generator \u201cextrapolate\u201d in a meaningful way? This is a fundamental question,\nbecause simulating MD is exorbitantly expensive, with each simulation time step being computation-\nally expensive but progressing time only of the order of 10\u221215 seconds, while often total simulation\n\n7\n\n\fFigure 5: Normalized bond (a,b) and angle (c,d) distributions of Alanine dipeptide compared to\nGaussian normal distribution (black). (a,c) True MD data. (b,d) Data from trajectories generated by\nDeepGenMSMs.\n\nFigure 6: DeepGenMSMs can generate physically realistic structures in areas that were not included\nin the training data.\n(c)\nRepresentative \u201creal\u201d molecular con\ufb01guration (from MD simulation) in each of the metastable states\n(sticks and balls), and the 100 closest con\ufb01gurations generated by the DeepGenMSM (lines).\n\n(b) Generated stationary distribution.\n\n(a) Distribution of training data.\n\ntimescales of 10\u22123 seconds or longer are needed. A DeepGenMSM that makes leaps of length \u03c4\n\u2013 orders of magnitude larger than the MD simulation time-step \u2013 and has even a small chance of\ngenerating new and meaningful structures would be extremely valuable to discover new states and\nthereby accelerate MD sampling.\n\nTo test this ability, we conduct six experiments, in each of which we remove all data belonging to one\nof the six metastable states of Alanine dipeptide (6a). We train a DeepGenMSM with each of these\ndatasets separately, and simulate it to predict the stationary distribution (6b). While the generated\nstationary distributions are skewed and the shape of the distribution in the (\u03c6, \u03c8) range with missing-\ndata are not quantitatively predicted, the DeepGenMSMs do indeed predict con\ufb01gurations where no\ntraining data was present (6b). Surprisingly, the quality of most of these con\ufb01gurations is high (6c).\nWhile the structures of the two low-populated states 5-6 do not look realistic, each of the metastable\nstates 1-4 are generated with high quality, as shown by the overlap of a real MD structure and the\n100 most similar generated structures (6c).\n\nIn conclusion, deep MSMs provide high-quality models of the stationary and kinetic properties for\nstochastic dynamical systems such as MD simulations.\nIn contrast to other high-quality models\nsuch as VAMPnets, the resulting model is truly probabilistic and can thus be physically interpreted\nand be used in a Bayesian framework. For the \ufb01rst time, it was shown that generating dynamical\ntrajectories in a 30-dimensional molecular con\ufb01guration space results in sampling of physically\nrealistic molecular structures. While Alanine dipeptide is a small system compared to proteins and\nother macromolecules that are of biological interest, our results demonstrate that ef\ufb01cient sampling\nof new molecular structures is possible with generative dynamic models, and improved methods\ncan be built upon this. Future methods will especially need to address the dif\ufb01culties of generating\nvalid con\ufb01gurations in low-probability regimes, and it is likely that the energy distance used here\nfor generator training needs to be revisited to achieve this goal.\n\n8\n\n\fAcknowledgements This work was funded by the European Research Commission (ERC CoG\n\u201cScaleCell\u201d), Deutsche Forschungsgemeinschaft (CRC 1114/A04, Transregio 186/A12, NO 825/4\u2013\n1, Dynlon P8), and the \u201c1000-Talent Program of Young Scientists in China\u201d.\n\nReferences\n\n[1] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein generative adversarial net-\n\nworks. In International Conference on Machine Learning, pages 214\u2013223, 2017.\n\n[2] G. R. Bowman, D. L. Ensign, and V. S. Pande. Enhanced Modeling via Network Theory:\nAdaptive Sampling of Markov State Models. J. Chem. Theory Comput., 6(3):787\u2013794, 2010.\n\n[3] G. R. Bowman, V. S. Pande, and F. No\u00e9, editors. An Introduction to Markov State Models\nand Their Application to Long Timescale Molecular Simulation., volume 797 of Advances in\nExperimental Medicine and Biology. Springer Heidelberg, 2014.\n\n[4] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data by\nsparse identi\ufb01cation of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USAP, 113:3932\u2013\n3937.\n\n[5] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network\n\nlearning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[6] Mart\u00edn Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems,\n\n2015. Software available from tensor\ufb02ow.org.\n\n[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recogni-\ntion, pages 770\u2013778, 2016.\n\n[8] C. X. Hern\u00e1ndez, H. K. Wayment-Steele, M. M. Sultan, B. E. Husic, and V. S. Pande. Varia-\n\ntional encoding of complex dynamics. arXiv:1711.08576, 2017.\n\n[9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[10] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR,\n\nabs/1412.6980, 2014.\n\n[11] Milan Korda and Igor Mezic. On convergence of extended dynamic mode decomposition to\n\nthe koopman operator. J. Nonlinear Sci., 28:687\u2013710, 2017.\n\n[12] S. Kube and M. Weber. A coarse graining method for the identi\ufb01cation of transition rates\n\nbetween molecular conformations. J. Chem. Phys., 126:024103, 2007.\n\n[13] Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis. Extended dynamic mode decomposi-\ntion with dictionary learning: a data-driven adaptive spectral decomposition of the koopman\noperator. Chaos, 27:103111, 2017.\n\n[14] B. Lusch and S. L. Brunton J . N. Kutz. Deep learning for universal linear embeddings of\n\nnonlinear dynamics. arXiv:1712.09707, 2017.\n\n[15] A. Mardt, L. Pasquali, H. Wu, and F. No\u00e9. Vampnets: Deep learning of molecular kinetics.\n\nNat. Commun., 9:5, 2018.\n\n[16] I. Mezi\u00b4c. Spectral properties of dynamical systems, model reduction and decompositions.\n\nNonlinear Dynam., 41:309\u2013325, 2005.\n\n[17] F. No\u00e9 and C. Clementi. Kinetic distance and kinetic maps from molecular dynamics simula-\n\ntion. J. Chem. Theory Comput., 11:5002\u20135011, 2015.\n\n[18] F. No\u00e9 and F. N\u00fcske. A variational approach to modeling slow processes in stochastic dynam-\n\nical systems. Multiscale Model. Simul., 11:635\u2013655, 2013.\n\n9\n\n\f[19] S. E. Otto and C. W. Rowley. Linearly-recurrent autoencoder networks for learning dynamics.\n\narXiv:1712.01378, 2017.\n\n[20] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS-W, 2017.\n\n[21] N. Plattner, S. Doerr, G. De Fabritiis, and F. No\u00e9. Protein-protein association and binding\n\nmechanism resolved in atomic detail. Nat. Chem., 9:1005\u20131011, 2017.\n\n[22] J.-H. Prinz, H. Wu, M. Sarich, B. G. Keller, M. Senne, M. Held, J. D. Chodera, C. Sch\u00fctte,\nand F. No\u00e9. Markov models of molecular kinetics: Generation and validation. J. Chem. Phys.,\n134:174105, 2011.\n\n[23] Jo\u00e3o Marcelo Lamim Ribeiro, Pablo Bravo, Yihang Wang, and Pratyush Tiwary. Reweighted\nautoencoded variational bayes for enhanced sampling (rave). J. Chem. Phys., 149:072301,\n2018.\n\n[24] M. Sarich, F. No\u00e9, and C. Sch\u00fctte. On the approximation quality of markov state models.\n\nMultiscale Model. Simul., 8:1154\u20131177, 2010.\n\n[25] M. Sarich and C. Sch\u00fctte. Metastability and Markov State Models in Molecular Dynamics.\n\nCourant Lecture Notes. American Mathematical Society, 2013.\n\n[26] P. J. Schmid and J. Sesterhenn. Dynamic mode decomposition of numerical and experimental\nIn 61st Annual Meeting of the APS Division of Fluid Dynamics. American Physical\n\ndata.\nSociety, 2008.\n\n[27] W. C. Swope, J. W. Pitera, and F. Suits. Describing protein folding kinetics by molecular\n\ndynamics simulations: 1. Theory. J. Phys. Chem. B, 108:6571\u20136581, 2004.\n\n[28] G. Sz\u00e9kely and M. Rizzo. Testing for equal distributions in high dimension. InterStat,, 5, 2004.\n\n[29] J. H. Tu, C. W. Rowley, D. M. Luchtenburg, S. L. Brunton, and J. N. Kutz. On dynamic mode\n\ndecomposition: Theory and applications. J. Comput. Dyn., 1(2):391\u2013421, dec 2014.\n\n[30] C. Wehmeyer and F. No\u00e9. Time-lagged autoencoders: Deep learning of slow collective vari-\n\nables for molecular kinetics. arXiv:1710.11239, 2017.\n\n[31] H. Wu and F. No\u00e9. Variational approach for learning markov processes from time series data.\n\narXiv:1707.04659, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1973, "authors": [{"given_name": "Hao", "family_name": "Wu", "institution": "Freie Universit\u00e4t Berlin"}, {"given_name": "Andreas", "family_name": "Mardt", "institution": null}, {"given_name": "Luca", "family_name": "Pasquali", "institution": "Freie Universitat Berlin"}, {"given_name": "Frank", "family_name": "Noe", "institution": "FU Berlin"}]}