{"title": "ODE2VAE: Deep generative second order ODEs with Bayesian neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 13412, "page_last": 13421, "abstract": "We present Ordinary Differential Equation Variational Auto-Encoder (ODE2VAE), a latent second order ODE model for high-dimensional sequential data. Leveraging the advances in deep generative models, ODE2VAE can simultaneously learn the embedding of high dimensional trajectories and infer arbitrarily complex continuous-time latent dynamics. Our model explicitly decomposes the latent space into momentum and position components and solves a second order ODE system, which is in contrast to recurrent neural network (RNN) based time series models and recently proposed black-box ODE techniques. In order to account for uncertainty, we propose probabilistic latent ODE dynamics parameterized by deep Bayesian neural networks. We demonstrate our approach on motion capture, image rotation, and bouncing balls datasets. We achieve state-of-the-art performance in long term motion prediction and imputation tasks.", "full_text": "ODE2VAE: Deep generative second order ODEs\n\nwith Bayesian neural networks\n\nC\u00b8 a\u02d8gatay Y\u0131ld\u0131z1, Markus Heinonen1,2, Harri L\u00a8ahdesm\u00a8aki1\n\n{cagatay.yildiz, markus.o.heinonen, harri.lahdesmaki}@aalto.fi\n\nDepartment of Computer Science\nAalto University, Finland, FI-00076\n\nAbstract\n\nWe present Ordinary Differential Equation Variational Auto-Encoder (ODE2VAE),\na latent second order ODE model for high-dimensional sequential data. Lever-\naging the advances in deep generative models, ODE2VAE can simultaneously\nlearn the embedding of high dimensional trajectories and infer arbitrarily complex\ncontinuous-time latent dynamics. Our model explicitly decomposes the latent\nspace into momentum and position components and solves a second order ODE\nsystem, which is in contrast to recurrent neural network (RNN) based time series\nmodels and recently proposed black-box ODE techniques. In order to account for\nuncertainty, we propose probabilistic latent ODE dynamics parameterized by deep\nBayesian neural networks. We demonstrate our approach on motion capture, image\nrotation and bouncing balls datasets. We achieve state-of-the-art performance in\nlong term motion prediction and imputation tasks.\n\n1\n\nIntroduction\n\nRepresentation learning has always been one of the most prominent problems in machine learning.\nLeveraging the advances in deep learning, variational auto-encoders (VAEs) have recently been\napplied to several challenging datasets to extract meaningful representations. Various extensions to\nvanilla VAE have achieved state-of-the-art performance in hierarchical organization of latent spaces,\ndisentanglement and semi-supervised learning (Tschannen et al., 2018).\nVAE based techniques usually assume a static data, in which each data item is associated with a\nsingle latent code. Hence, auto-encoder models for sequential data have been overlooked. More\nrecently, there have been attempts to use recurrent neural network (RNN) encoders and decoders for\ntasks such as representation learning, classi\ufb01cation and forecasting (Srivastava et al., 2015; Lotter\net al., 2016; Hsu et al., 2017; Li and Mandt, 2018). Other than neural ordinary differential equations\n(ODEs) (Chen et al., 2018b) and Gaussian process prior VAEs (GPPVAE) (Casale et al., 2018),\naforementioned methods operate in discrete-time, which is in contrast to most of the real-world\ndatasets, and fail to produce plausible long-term forecasts (Karl et al., 2016).\nIn this paper, we propose ODE2VAEs that extend VAEs for sequential data with a latent space\ngoverned by a continuous-time probabilistic ODE. We propose a powerful second order ODE that\nallows modelling the latent dynamic ODE state decomposed as position and momentum. To handle\nuncertainty in dynamics and avoid over\ufb01tting, we parameterise our latent continuous-time dynamics\nwith deep Bayesian neural networks and optimize the model using variational inference. We show\nstate-of-the-art performance in learning, reproducing and forecasting high-dimensional sequential\nsystems, such as image sequences. An implementation of our experiments and generated video\nsequences are provided at https://github.com/cagatayyildiz/ODE2VAE.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Probabilistic second-order ODEs\n\nWe tackle the problem of learning low-rank latent representations of possibly high-dimensional\nsequential data trajectories. We assume data sequences x0:N := (x0, x1, . . . , xN ) with individual\nframes xk \u2208 RD observed at time points t0, . . . , tN . We will present the methodology for a single\ndata sequence x0:N for notational simplicity, but it is straighforward to extend our method to multiple\nsequences. The observations are often at discrete spacings, such as individual images in a video\nsequence, but our model also generalizes to irregular sampling.\nWe assume that there exists an underlying generative low-dimensional continuous-time dynamical\nsystem, which we aim to uncover. Our goal is to learn latent representations zt \u2208 Rd of the sequence\ndynamics with d (cid:28) D, and reconstruct observations xt \u2208 RD for missing frame imputation and\nforecasting the system past observed time tN .\n\n2.1 Ordinary differential equations\n\nIn discrete-time sequential systems the state sequence z0, z1, . . . is indexed by a discrete variable\nk \u2208 Z, and the state progression is governed by a transition function on the change \u2206zk = zk \u2212 zk\u22121.\nExamples of such models are auto-regressive models, Markov chains, recurrent models and neural\nnetwork layers.\nIn contrast, continuous-time sequential systems model the state function zt : T \u2192 Rd of a continuous,\nreal-valued time variable t \u2208 T = R. The state evolution is governed by a \ufb01rst-order time derivative\n(1)\nthat drives the system state forward in in\ufb01nitesimal steps over time. The differential h : Rd \u2192 Rd\ninduces a differential \ufb01eld that covers the input space. Given an initial location vector z0 \u2208 Rd, the\nsystem then follows an ordinary differential equation (ODE) model with state solutions\n\n= h(zt),\n\ndzt\ndt\n\n\u02d9zt :=\n\nzT = z0 +\n\nh(zt)dt.\n\n(2)\n\nThe state solutions are in practise computed by solving this initial value problem with ef\ufb01cient\nnumericals solvers, such as Runge-Kutta (Schober et al., 2019). Recently several works have\nproposed learning ODE systems h parametrised as neural networks (Chen et al., 2018b) or as\nGaussian processes (Heinonen et al., 2018).\n\n0\n\n2.2 Bayesian second-order ODEs\n\nFirst-order ODEs are incapable of modelling high-order dynamics1, such as acceleration or the motion\nof a pendulum. Furthermore, ODEs are deterministic systems unable to account for uncertainties in\nthe dynamics. We tackle both issues by introducing Bayesian neural second-order ODEs\n\n(cid:90) T\n\nwhich can be reduced to an equivalent system of two coupled \ufb01rst-order ODEs\n\n(cid:110) \u02d9st =\n\n\u02d9vt = fW (st, vt)\n\nvt\n\n,\n\n\u00a8zt :=\n\n= fW (zt, \u02d9zt),\n\nd2zt\nd2t\n\n(cid:20)sT\n\nvT\n\n(cid:21)\n\n(cid:20)s0\n\n(cid:21)\n\nv0\n\n=\n\n+\n\n(3)\n\n(4)\n\ndt,\n\n(cid:90) T\n\n0\n\n(cid:20)\n(cid:124)\n\n(cid:21)\n(cid:125)\n\nfW (st, vt)\n\nvt\n\n(cid:123)(cid:122)\n\n\u02dcfW (zt)\n\nwhere (with a slight abuse of notation) the state tuple zt = (st, vt) decomposes into the state position\nst, which follows the state velocity (momentum) vt. The velocity or evolution of change is governed\nby a neural network fW (st, vt) with a collection of weight parameters W = {W(cid:96)}L\n(cid:96)=1 over its L\nlayers and the bias terms. We assume a prior p(W) on the weights resulting in a Bayesian neural\nnetwork (BNN). Each weight sample, in turn, results in a deterministic ODE trajectory (see Fig. 1).\nThe BNN acceleration \ufb01eld fW : Rd \u00d7 Rd \u2192 Rd depends on both state and velocity. For instance, in\na pendulum system the acceleration \u00a8z depends on both its current location and velocity. The system\nis now driven forward from starting position s0 and velocity v0, with the BNN determining only how\nthe velocity vt evolves.\n\n1Time-dependent differential functions f (z, t) can indirectly approximate higher-order dynamics.\n\n2\n\n\fFigure 1: Illustration of dynamical systems. A continuous-time system underlying a discrete-time\nmodel (a) can be extended to a 2nd-order ODE with velocity component (b). A Bayesian ODE\ncharacterises uncertain differential dynamics (c), with the corresponding position-velocity phase\ndiagram (d). The gray arrows in (d) indicate the BNN fW (st, vt) mean \ufb01eld wrt p(W).\n\n2.3 Second order ODE \ufb02ow\n\nThe ODE systems are denoted as continuous normalizing \ufb02ows when they are applied on random\nvariables zt (Rezende et al., 2014; Chen et al., 2018a; Grathwohl et al., 2018). This allows following\nthe progression of its density through the ODE. Using the instantaneous change of variable theorem\n(Chen et al., 2018a), we obtain the instantaneous change of variable for our second order ODEs as\n\u2202 log q(zt|W)\n\n(cid:18) \u2202fW (st, vt)\n\n(cid:32) \u2202vt\n\nd\u02dcfW (zt)\n\n(cid:32)\n\n(cid:33)\n\n(cid:33)\n\n(cid:19)\n\ndt = \u2212Tr\n\n\u2202st\n\n\u2202fW (st,vt)\n\n\u2202fW (st,vt)\n\n\u2202vt\n\u2202vt\n\n= \u2212Tr\n\n= \u2212Tr\n\n\u2202t\n\ndzt\n\nwhich results in the log densities over time,\n\nlog q(zT|W) = log q(z0|W) \u2212\n\n3 ODE2VAE model\n\n\u2202st\n\n(cid:90) T\n\n0\n\nTr\n\n\u2202vt\n\n(cid:18) \u2202fW (st, vt)\n\n(cid:19)\n\n\u2202vt\n\ndt.\n\n\u2202vt\n\n,\n\n(5)\n\n(6)\n\nIn this section we propose a novel dynamic VAE formalism for sequential data by introducing a\nsecond order Bayesian neural ODE model in the latent space to model the data dynamics. We start by\nreviewing the standard VAE models and then extend it to our ODE2VAE model.\nWith auto-encoders, we aim to learn latent representations z \u2208 Rd for complex observations x \u2208 RD\nparameterised by \u03b8, where often d (cid:28) D. The posterior p\u03b8(z|x) \u221d p\u03b8(x|z)p(z) is proportional to\nthe prior p(z) of the latent variable and the decoding likelihood p\u03b8(x|z). Parameters \u03b8 could be\noptimized by maximizing the marginal log likelihood but that generally involves intractable integrals.\nIn variational auto-encoders (VAE) an amortized variational approximation q\u03c6(z|x) \u2248 p\u03b8(z|x) with\nparameters \u03c6 is used instead (Jordan et al., 1999; Kingma and Welling, 2013; Rezende et al., 2014).\nVariational inference that minimizes the Kullback-Leibler divergence, or equivalently maximizes the\nevidence lower bound (ELBO), results in ef\ufb01cient inference.\n\n3.1 Dynamic model\n\ns0 \u223c p(s0)\nv0 \u223c p(v0)\n\nBuilding upon the ideas from black-box ODEs and varia-\ntional auto-encoders, we propose to infer continuous-time\nlatent position and velocity trajectories that live in a much\nlower dimensional space but still match the data well (see\nFig. 2 for illustration). For this, consider a generative\nmodel that consists of three components: (i) a distribution\nfor the initial position p(s0) and velocity p(v0) in the la-\ntent space , (ii) true (unknown) dynamics de\ufb01ned by an\nacceleration \ufb01eld, and (iii) a decoding likelihood p(xi|si).\n(11)\nThe generative model is given in Eqs. 7-11. Note that the decoding likelihood is de\ufb01ned only from\nthe position variable. Velocity thus serves as an auxiliary variable, driving the position forward.\n\nvt = v0 +\nxi \u223c p(xi|si)\n\nftrue(s\u03c4 , v\u03c4 )d\u03c4\ni \u2208 [0, N ]\n\n(cid:90) t\n(cid:90) t\n\nst = s0 +\n\nv\u03c4 d\u03c4\n\n(10)\n\n(9)\n\n0\n\n0\n\n(7)\n(8)\n\n3\n\n\fFigure 2: A schematic illustration of ODE2VAE model. Position encoder (\u00b5s, \u03c3s) maps the \ufb01rst item\nx0 of a high-dimensional data sequence into a distribution of the initial position s0 in a latent space.\nVelocity encoder (\u00b5v, \u03c3v) maps the \ufb01rst m high-dimensional data items x0:m into a distribution of\nthe initial velocity v0 in a latent space. Probabilistic latent dynamics are implemented by a second\norder ODE model \u02dcfW parameterised by a Bayesian deep neural network (W). Data points in the\noriginal data domain are reconstructed by a decoder.\n\n3.2 Variational inference\n\nAs with standard auto-encoders, optimization of ODE2VAE model parameters with respect to\nmarginal likelihood would result in intractability and thus we resort to variational inference (see\nFig. 2). We \ufb01rst combine the latent position and velocity components into a single vector zt :=\n(st, vt) for notational clarity, and assume the following factorized variational approximation for the\nunobserved quantities q(W, z0:N|x0:N ) = q(W)qenc(z0|x0:N )qode(z1:N|x0:N , z0,W). As decribed\nin subsection 2.2, true dynamics are approximated by a BNN parameterized by W with the following\nvariational approximation: q(W) = N (W|m, sI). We use an amortized variational approximation\nfor the latent initial position and velocity\nqenc(z0|x0:N ) = qenc\n\n(cid:18)diag(\u03c3s(x0))\n\n(cid:18)(cid:18) \u00b5s(x0)\n\n(cid:18)(cid:18)s0\n\n= N\n\n(cid:19)\n\n(cid:19)\n\n(cid:19)(cid:19)\n\n0\n\n,\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) x0:N\n\nv0\n\n0\n\ndiag(\u03c3v(x0:m))\n\n\u00b5v(x0:m)\n\n,\n\n(12)\nwhere \u00b5s, \u00b5v, \u03c3s, \u03c3v are encoding neural networks. The encoder for the initial position depends\nsolely on the \ufb01rst item in the data sequence x0, whereas the encoder for the initial velocity depends\non multiple data points x0:m, where m \u2264 N is the amortized inference length. We use neural\nnetwork encoders and decoders whose architectures depend on the application (see the supplementary\ndocument for details). The variational approximation for the latent dynamics qode(z1:N|x0:N , z0,W)\nis de\ufb01ned implicitly via the instantaneous change of variable for the second order ODEs shown in\nEq. 5. The initial density is given by the encoder qenc(z0|x0), and density for later points can be\nsolved by numerical integration using Eq. 6. Note that we treat the entire latent trajectory evaluated\nat observed time points, Z \u2261 z0:N , as a latent variable, and the latent trajectory samples z1:N are\nsolved conditioned on the ODE initial values z0 and BNN parameter values W. Finally, evidence\nlower bound (ELBO) becomes as follows (for brevity we de\ufb01ne X \u2261 x0:N ):\nlog p(X) \u2265 \u2212 KL[q(W, Z|X)||p(W, Z)] + Eq(W,Z|X)[log p(X|W, Z)]\n\n(13)\n\n(cid:124)\n\n(cid:20)\n\n(cid:123)(cid:122)\n\n(cid:21)\n\n(cid:20)\n(cid:20)\n\n\u2212 log\n\n\u2212 log\n\nELBO\n\nq(W)q(Z|W, X)\n\n= \u2212Eq(W,Z|X)\n\np(W)p(Z)\n= \u2212 KL[q(W)||p(W)] + Eq(W,Z|X)\n\nlog\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n= \u2212 KL[q(W)||p(W)]\n\nODE regularization\n\n(cid:125)\n\n+ Eqenc(z0|X)\n\n(cid:124)\n\n4\n\n(cid:125)\n\n(cid:21)\n\n(14)\n\n(15)\n\n+ Eq(W,Z|X)[log p(X|W, Z)]\n\nq(Z|W, X)\n\np(Z)\nqenc(z0|X)\n\n(cid:123)(cid:122)\n\np(z0)\n\nVAE loss\n\n+ log p(X|W, Z)\n\n+ log p(x0|z0)\n\n(cid:21)\n(cid:125)\n\n\f(cid:20)\n\n\u2212 log\n\nEqode(W,zi|X,z0)\n\nN(cid:88)\n(cid:124)\n\ni=1\n\n+\n\nqode(zi|W, X)\n\np(zi)\n\n(cid:123)(cid:122)\n\ndynamic loss\n\n+ log p(xi|zi)\n\n(cid:21)\n(cid:125)\n\n(16)\n\nwhere the prior distribution p(W, z0) is a standard Gaussian. The prior density follows Eq. 6 with\nfW replaced by the unknown ftrue, which causes p(zt), t > 1 to be intractable.2 Thus, we resort to a\nsimplifying assumption and place a standard regularizing Gaussian prior over z1:N .\nWe now examine each term in Eq. 16. The \ufb01rst term is the BNN weight penalty, which helps avoiding\nover\ufb01tting. The second term is the standard VAE bound, meaning that VAE is retrieved for sequences\nof length 1. The only (but major) difference between the second and the third terms is that the\nexpectation is computed with respect to the variational distribution induced by the second order ODE.\nFinally, we optimize the Monte Carlo estimate of Eq. 16 with respect to variational posterior {m, s},\nencoder and decoder parameters, and also make use of reparameterization trick to tackle uncertanties\nin both the initial latent states and in the acceleration dynamics (Kingma and Welling, 2013).\n\n3.3 Penalized variational loss function\n\nA well-known pitfall of VAE models is that optimizing the ELBO objective does not necessarily\nresult in accurate inference (Alemi et al., 2017). Several recipes have already been proposed to\ncounteract the imbalance between the KL term and reconstruction likelihood (Zhao et al., 2017;\nHiggins et al., 2017). In this work, we borrow the ideas from Higgins et al. (2017) and weight the\nKL[q(W)||p(W)] term resulting from the BNN with a constant factor \u03b2. We choose to \ufb01x \u03b2 to the\nratio between the latent space dimensionality and number of weight parameters, \u03b2 = |q|/|W|, in\norder to counter-balance the penalties on latent variables W and zi.\nOur variational model utilizes encoders only for obtaining the initial latent distribution. In cases\nof long input sequences, dynamic loss term can easily dominate VAE loss, which may cause the\nencoders to under\ufb01t. The under\ufb01tting may also occur in small data regimes or when the distribution\nof initial data points differs from data distribution. In order to tackle this, we propose to minimize the\ndistance between the encoder distribution and the distribution induced by the ODE \ufb02ow (Eqs. 12 and\n6). At the end, we have an alternative, penalized target function, which we call ODE2VAE-KL:\n\nLODE2VAE = \u2212\u03b2 KL[q(W)||p(W)] + Eq(W,Z|X)\n\n\u2212 log\n\n+ log p(X|W, Z)\n\n(17)\n\n(cid:20)\n\n(cid:21)\n\nq(Z|W, X)\n\np(Z)\n\n\u2212 \u03b3Eq(W) [KL[qode(Z|X)||qenc(Z|W, X)]] .\n\nWe choose the constant \u03b3 by cross-validation. In practice, we found out that an annealing scheme in\nwhich \u03b3 is gradually increased helps optimization, which is also used in (Karl et al., 2016; Rezende\nand Mohamed, 2015).\n\n3.4 Related work\n\nDespite the recent VAE and GAN breakthroughs, little attention has been paid to deep generative\narchitectures for sequential data. Existing VAE-based sequential models rely heavily on RNN\nencoders and decoders (Chung et al., 2015; Serban et al., 2017), with very few interest in stochastic\nmodels (Fraccaro et al., 2016). Some research has been carried out to approximate latent dynamics by\nLSTMs (Lotter et al., 2016; Hsu et al., 2017; Li and Mandt, 2018), which results in observations to\nbe included in latent transition process. Consequently, the inferred latent space and dynamics do not\nfully re\ufb02ect the observed phenomena and usually fail to produce decent long term predictions (Karl\net al., 2016). In addition, RNNs are shown to be incapable of accurately modeling nonuniformly\nsampled sequences (Chen et al., 2018b), despite the recent efforts that incorporate time information\nin RNN architectures (Li et al., 2017; Xiao et al., 2018).\nRecently, neural ODEs introduced learning ODE systems with neural network architectures, and\nproposed it for the VAE latent space as well for simple cases (Chen et al., 2018b). In Gaussian\nprocess prior VAE, a GP prior is placed in the latent space over a sequential index (Casale et al.,\n2018). To the best of our knowledge, there is no work connecting second order ODEs and Bayesian\nneural networks with VAE models.\n\n2Although our variational approximation model assumes deterministic second-order dynamics, the underlying\n\ntrue model may also have more complex or stochastic dynamics.\n\n5\n\n\fTable 1: Comparison of VAE-based models\n\nHigher order Continuous-time\n\ndynamics\n\nStochastic\n\nReference\nMethod\nKingma and Welling (2013)\nVAE\nChung et al. (2015)\nVRNN\nFraccaro et al. (2016)\nSRNN\nCasale et al. (2018)\nGPPVAE\nLi and Mandt (2018)\nDSAE\nChen et al. (2018b)\nNeural ODE\ncurrent work\nODE2VAE\n\u2217 GPPVAE uses a latent GP prior but only a discrete case was demonstrated in Casale et al. (2018).\n\nstate\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\u0013\n\n\u0017\n\u0017\n\u0017\n\u0013\u2217\n\u0017\n\u0013\n\u0013\n\n\u0017\n\u0017\n\u0013\n\u0017\n\u0013\n\u0017\n\u0013\n\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0017\n\u0013\n\n4 Experiments\n\nWe illustrate the performance of our model on three different datasets: human motion capture (see\nthe acknowledgements), rotating MNIST (Casale et al., 2018) and bouncing balls (Sutskever et al.,\n2009). Our goal is twofold: First, given a walking or bouncing balls sequence, we aim to predict the\nfuture sensor readings and frames. Second, we would like to interpolate an unseen rotation angle\nfrom a sequence of rotating digits. The competing techniques are speci\ufb01ed in each section. For all\nmethods, we have directly applied the public implementations provided by the authors. Also, we\nhave tried several values for the hyper-parameters with the same rigor and we report the best results.\nTo numerically compare the models, we sample 50 predictions per test sequence and report the mean\nand standard deviation of the mean squared error (MSE) over future frames. We include the mean\nMSE of mean predictions (instead of trajectory samples) in the supplementary.\nWe implement our model in Tensor\ufb02ow (Abadi et al., 2016). Encoder, differential function and\nthe decoder parameters are jointly optimized with Adam optimizer (Kingma and Ba, 2014) with\nlearning rate 0.001. We use Tensor\ufb02ow\u2019s own odeint fixed function, which implements fourth\norder Runge-Kutta method, for solving the ODE systems on a time grid that is \ufb01ve times denser than\nthe observed time points. Neural network hyperparameters, chosen by cross-validation, are detailed\nin the supplementary material. We also include ablation studies with deterministic NNs and \ufb01rst\norder dynamics in the appendix.\n\n4.1 CMU walking data\n\nTo demonstrate that our model can capture arbitrary dynamics from noisy observations, we experiment\non two datasets extracted from CMU motion capture library. First, we use the dataset in Heinonen\net al. (2018), which consists of 43 walking sequences of several subjects, each of which is \ufb01tted\nseparately. The \ufb01rst two-third of each sequence is reserved for training and validation, and the\nrest is used for testing. Second dataset consists of 23 walking sequences of subject 35 (Gan et al.,\n2015), which is partitioned into 16 training, 3 validation and 4 test sequences. We followed the\npreprocessing described in Wang et al. (2008), after which we were left with 50 dimensional joint\nangle measurements.\nWe compare our ODE2VAE against a GP-based state space model GPDM (Wang et al., 2008), a\ndynamic model with latent GP interpolation VGPLVM (Damianou et al., 2011), two black-box\nODE solvers npODE (Heinonen et al., 2018) and neural ODEs (Chen et al., 2018b), as well as an\nRNN-based deep generative model DTSBN-S (Gan et al., 2015). In test mode, we input the \ufb01rst\nthree frames and the models predict future observations. GPDM and VGPLVM are not applied to\nthe second dataset since GPDM optimizes its latent space for input trajectories and hence does not\nallow simulating dynamics from any random point, and VGPLVM implementation does not support\nmultiple input sequences.\nThe results are presented in Table 2. First, we reproduce the results in Heinonen et al. (2018) by\nobtaining the same ranking among GPDM, VGPLVM and npODE. Next, we see that DTSBN-S is not\nable to predict the distant future accurately, which is a well-known problem with RNNs. As expected,\nall models attain smaller test errors on the second, bigger dataset. We observe that neural ODE usually\nperfectly \ufb01ts the training data but failed to extrapolate on the \ufb01rst dataset. This over\ufb01tting problem\n\n6\n\n\fTable 2: Average MSE on future frames\n\nTest error\n\nModel\nGPDM\nVGPLVM\nDTSBN-S\nNPODE\nNEURALODE\nODE2VAE\nODE2VAE-KL\n\nMocap-1\n126.46 \u00b1 34\n142.18 \u00b1 1.92\n80.21 \u00b1 0.04\n87.23 \u00b1 0.02\n93.07 \u00b1 0.72\n15.99 \u00b1 4.16\n\n45.74\n\nMocap-2\n\nN/A\nN/A\n\n22.96\n\n34.86 \u00b1 0.02\n22.49 \u00b1 0.88\n10.06 \u00b1 1.4\n8.09 \u00b1 1.95\n\nReference\nWang et al. (2008)\nDamianou et al. (2011)\nGan et al. (2015)\nHeinonen et al. (2018)\nChen et al. (2018b)\ncurrent work\ncurrent work\n\nis not surprising considering the fact that only ODE initial value distribution is penalized. On the\ncontrary, our ODE2VAE regularizes its entire latent trajectory and also samples from the acceleration\n\ufb01eld, both of which help tackling over\ufb01tting problem. We demonstrate latent state trajectory samples\nand reconstructions from our model in the supplementary.\n\n4.2 Rotating MNIST\n\nNext, we contrast our ODE2VAE against recently proposed Gaussian process prior VAE (GPPVAE)\n(Casale et al., 2018), which replaces the commonly iid Gaussian prior with a GP and thus performs\nlatent regression. We repeat the experiment in Casale et al. (2018) by constructing a dataset by\nrotating the images of handwritten \u201c3\u201d digits. We consider the same number of rotation angles (16),\ntraining and validation sequences (360&40), and leave the same rotation angle out for testing (see the\n\ufb01rst row of Figure 4b for the test angle). In addition, four rotation angles are randomly removed from\neach rotation sequence to introduce non-uniform sequences and missing data (an example training\nsequence is visualized in the \ufb01rst row of Figure 4a).\nTest errors on the unseen rotation angle are given in Table 3. During test time, GPPVAE encodes\nand decodes the images from the test angle, and the reconstruction error is reported. On the other\nhand, ODE2VAE only encodes the \ufb01rst image in a given sequence, performs latent ODE integration\nstarting from the encoded point, and decodes at given time points - without seeing the test image\neven in test mode. In that sense, our model is capable of generating images with arbitrary rotation\nangles. Also note that both models make use of the angle/time information in training and test mode.\nAn example input sequence with missing values and corresponding reconstructions are illustrated in\nFigure 4a, where we see that ODE2VAE nicely \ufb01lls in the gaps. Also, Figure 4b demonstrates our\nmodel is capable of accurately learning and rotating different handwriting styles.\n\nTable 3: Average prediction errors on test angle\n\nMODEL\nGPPVAE-DIS(cid:5)\nGPPVAE-JOINT(cid:5)\nODE2VAE\nODE2VAE-KL\n\nTEST ERROR\n\n0.0309 \u00b1 0.00002\n0.0288 \u00b1 0.00005\n0.0194 \u00b1 0.00006\n0.0188 \u00b1 0.0003\n\n4.3 Bouncing balls\n\nFigure 3: Bouncing balls errors.\n\nAs a third showcase, we test our model on bouncing balls dataset, a standard benchmark used in\ngenerative temporal modeling literature (Gan et al., 2015; Hsieh et al., 2018; Lotter et al., 2015). The\ndataset consists of video frames of three balls bouncing within a rectangular box and also colliding\nwith each other. The exact locations of the balls as well as physical interaction rules are to be inferred\nfrom the observed sequences. We make no prior assumption on visual aspects such as ball count,\nmass, shape or on the underlying physical dynamics.\n\n7\n\n135791113151719Time00.050.10.150.2Predictive MSE\fFigure 4: Panel (a) shows a training sequence with missing values (\ufb01rst row) and its reconstruction\n(second row). First row in panel (b) demonstrates test angles from different sequences, i.e., hand-\nwriting styles, and below are model predictions.\n\nWe have generated a training set of 10000 sequences of length 20 frames and a test set of 500\nsequences using the implementation provided with Sutskever et al. (2009). Each frame is 32x32x1\nand pixel values vary between 0 and 1. We compare our method against DTSBN-S (Gan et al., 2015)\nand decompositional disentangled predictive auto-encoder (DDPAE) (Hsieh et al., 2018), both of\nwhich conduct experiments on the same dataset. In test mode, \ufb01rst three frames of an input sequence\nare given as input and per pixel MSE on the following 10 frames are computed. We believe that\nmeasuring longer forecast errors is more informative about the inference of physical phenomena than\nreporting one-step-ahead prediction error, which is predominantly used in current literature (Gan\net al., 2015; Lotter et al., 2015).\nPredictive errors and example reconstructions are visualized in Figures 3 and 5. The RNN-based\nDTSBN-S nicely extrapolates a few frames but quickly loses track of ball locations and the error\nescalates. DDPAE achieves a much smaller error over time; however, we empirically observed that\nthe reconstructed images are usually imperfect (here, generated balls are bigger than the originals),\nand also the model sometimes fails to simulate ball collisions as in Figure 5. Our ODE2VAE generates\nlong and accurate forecasts and signi\ufb01cantly improves the current state-of-the-art by almost halving\nthe error. We empirically found out that a CNN encoder that takes channel-stacked frames as input\nyields smaller prediction error than an RNN encoder. We leave the investigation of better encoder\narchitectures as an interesting future work.\n\n5 Discussion\n\nWe have presented an extension to VAEs for continuous-time dynamic modelling. We decompose the\nlatent space into position and velocity components, and introduce a powerful neural second order\ndifferential equation system. As shown empirically, our variational inference framework results\nin Bayesian neural network that helps tackling over\ufb01tting problem. We achieve state-of-the-art\nperformance in long-term forecasting and imputation of high-dimensional image sequences.\nThere are several directions in which our work can be extended. Considering divergences different\nthan KL would lead to Wasserstein auto-encoder formulations (Tolstikhin et al., 2017). The latent\nODE \ufb02ow can be replaced by stochastic \ufb02ow, which would result in an even more robust model.\nProposed second order \ufb02ow can also be combined with generative adversarial networks to produce\nreal-looking videos.\n\nFigure 5: An example test sequence from bouncing ball experiment. Top row is the original sequence.\nEach model takes the \ufb01rst three frames as input and predicts the further frames.\n\n8\n\n\fAcknowledgements.\n\nThe data used in this project was obtained from mocap.cs.cmu.edu. The database was created with\nfunding from NSF EIA-0196217. The calculations presented above were performed using computer\nresources within the Aalto University School of Science Science-IT project. This work has been\nsupported by the Academy of Finland grants no. 311584 and 313271.\n\nReferences\nMart\u00b4\u0131n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,\nSanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for large-scale\nmachine learning. In OSDI, volume 16, pages 265\u2013283, 2016.\n\nAlexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy.\n\nFixing a broken elbo. arXiv preprint arXiv:1711.00464, 2017.\n\nFrancesco Paolo Casale, Adrian Dalca, Luca Saglietti, Jennifer Listgarten, and Nicolo Fusi. Gaussian\nprocess prior variational autoencoders. In Advances in Neural Information Processing Systems,\npages 10369\u201310380, 2018.\n\nC. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L Carin. Continuous-time \ufb02ows for ef\ufb01cient inference\n\nand density estimation. ICML, 2018a.\n\nTian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary dif-\nferential equations. In Advances in Neural Information Processing Systems, pages 6571\u20136583,\n2018b.\n\nJunyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio.\nA recurrent latent variable model for sequential data. In Advances in neural information processing\nsystems, pages 2980\u20132988, 2015.\n\nAndreas Damianou, Michalis K Titsias, and Neil D Lawrence. Variational gaussian process dynamical\n\nsystems. In Advances in Neural Information Processing Systems, pages 2510\u20132518, 2011.\n\nMarco Fraccaro, Soren Kaae Sonderby, Ulrich Paquet, and Ole Winther. Sequential neural models\nwith stochastic layers. In Advances in neural information processing systems, pages 2199\u20132207,\n2016.\n\nZhe Gan, Chunyuan Li, Ricardo Henao, David E Carlson, and Lawrence Carin. Deep temporal\nsigmoid belief networks for sequence modeling. In Advances in Neural Information Processing\nSystems, pages 2467\u20132475, 2015.\n\nWill Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord:\narXiv preprint\n\nFree-form continuous dynamics for scalable reversible generative models.\narXiv:1810.01367, 2018.\n\nMarkus Heinonen, Cagatay Yildiz, Henrik Mannerstr\u00a8om, Jukka Intosalmi, and Harri L\u00a8ahdesm\u00a8aki.\nLearning unknown ODE models with Gaussian processes. In Jennifer Dy and Andreas Krause,\neditors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of\nProceedings of Machine Learning Research, pages 1959\u20131968, Stockholmsmssan, Stockholm\nSweden, 10\u201315 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/heinonen18a.\nhtml.\n\nIrina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In International Conference on Learning Representations,\nvolume 3, 2017.\n\nJun-Ting Hsieh, Bingbin Liu, De-An Huang, Li F Fei-Fei, and Juan Carlos Niebles. Learning to\ndecompose and disentangle representations for video prediction. In Advances in Neural Information\nProcessing Systems, pages 517\u2013526, 2018.\n\n9\n\n\fWei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable\nrepresentations from sequential data. In Advances in neural information processing systems, pages\n1878\u20131889, 2017.\n\nMichael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to\n\nvariational methods for graphical models. Machine learning, 37(2):183\u2013233, 1999.\n\nMaximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational\nbayes \ufb01lters: Unsupervised learning of state space models from raw data. arXiv preprint\narXiv:1605.06432, 2016.\n\nDiederik P Kingma and Jimmy Lei Ba. Adam: Amethod for stochastic optimization. In Proc. 3rd Int.\n\nConf. Learn. Representations, 2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nYang Li, Nan Du, and Samy Bengio. Time-dependent representation for neural event sequence\n\nprediction. arXiv preprint arXiv:1708.00065, 2017.\n\nYingzhen Li and Stephan Mandt. Disentangled sequential autoencoder.\n\narXiv:1803.02991, 2018.\n\narXiv preprint\n\nWilliam Lotter, Gabriel Kreiman, and David Cox. Unsupervised learning of visual structure using\n\npredictive generative networks. arXiv preprint arXiv:1511.06380, 2015.\n\nWilliam Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video\n\nprediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016.\n\nDanilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows. arXiv\n\npreprint arXiv:1505.05770, 2015.\n\nDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\n\napproximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\nMichael Schober, Simo S\u00a8arkk\u00a8a, and Philipp Hennig. A probabilistic model for the numerical solution\n\nof initial value problems. Statistics and Computing, 29(1):99\u2013122, 2019.\n\nIulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville,\nand Yoshua Bengio. A hierarchical latent variable encoder-decoder model for generating dialogues.\nIn Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\nNitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video\n\nrepresentations using lstms. arXiv preprint arXiv:1502.04681, 2015.\n\nIlya Sutskever, Geoffrey E Hinton, and Graham W Taylor. The recurrent temporal restricted boltzmann\n\nmachine. In Advances in neural information processing systems, pages 1601\u20131608, 2009.\n\nIlya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders.\n\narXiv preprint arXiv:1711.01558, 2017.\n\nMichael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-based\n\nrepresentation learning. arXiv preprint arXiv:1812.05069, 2018.\n\nJack M Wang, David J Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human\nmotion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283\u2013298, 2008.\n\nShuai Xiao, Hongteng Xu, Junchi Yan, Mehrdad Farajtabar, Xiaokang Yang, Le Song, and Hongyuan\nZha. Learning conditional generative models for temporal point processes. In Thirty-Second AAAI\nConference on Arti\ufb01cial Intelligence, 2018.\n\nShengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing variational\n\nautoencoders. arXiv preprint arXiv:1706.02262, 2017.\n\n10\n\n\f", "award": [], "sourceid": 7407, "authors": [{"given_name": "Cagatay", "family_name": "Yildiz", "institution": "Aalto University"}, {"given_name": "Markus", "family_name": "Heinonen", "institution": "Aalto University"}, {"given_name": "Harri", "family_name": "Lahdesmaki", "institution": "Aalto University"}]}