{"title": "Latent Ordinary Differential Equations for Irregularly-Sampled Time Series", "book": "Advances in Neural Information Processing Systems", "page_first": 5320, "page_last": 5330, "abstract": "Time series with non-uniform intervals occur in many applications, and are difficult to model using standard recurrent neural networks (RNNs). We generalize RNNs to have continuous-time hidden dynamics defined by ordinary differential equations (ODEs), a model we call ODE-RNNs. Furthermore, we use ODE-RNNs to replace the recognition network of the recently-proposed Latent ODE model. Both ODE-RNNs and Latent ODEs can naturally handle arbitrary time gaps between observations, and can explicitly model the probability of observation times using Poisson processes. We show experimentally that these ODE-based models outperform their RNN-based counterparts on irregularly-sampled data.", "full_text": "Latent ODEs for Irregularly-Sampled Time Series\n\nYulia Rubanova, Ricky T. Q. Chen, David Duvenaud\n\nUniversity of Toronto and the Vector Institute\n\n{rubanova, rtqichen, duvenaud}@cs.toronto.edu\n\nAbstract\n\nTime series with non-uniform intervals occur in many applications, and are dif-\n\ufb01cult to model using standard recurrent neural networks (RNNs). We generalize\nRNNs to have continuous-time hidden dynamics de\ufb01ned by ordinary differential\nequations (ODEs), a model we call ODE-RNNs. Furthermore, we use ODE-RNNs\nto replace the recognition network of the recently-proposed Latent ODE model.\nBoth ODE-RNNs and Latent ODEs can naturally handle arbitrary time gaps be-\ntween observations, and can explicitly model the probability of observation times\nusing Poisson processes. We show experimentally that these ODE-based models\noutperform their RNN-based counterparts on irregularly-sampled data.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) are the dominant\nmodel class for high-dimensional, regularly-sampled time\nseries data, such as text or speech. However, they are an\nawkward \ufb01t for irregularly-sampled time series data, com-\nmon in medical or business settings. A standard trick for\napplying RNNs to irregular time series is to divide the\ntimeline into equally-sized intervals, and impute or ag-\ngregate observations using averages. Such preprocessing\ndestroys information, particularly about the timing of\nmeasurements, which can be informative about latent\nvariables [Lipton et al., 2016, Che et al., 2018].\nAn approach which better matches reality is to construct\na continuous-time model with a latent state de\ufb01ned at all\ntimes. Recently, steps have been taken in this direction,\nde\ufb01ning RNNs with continuous dynamics given by a sim-\nple exponential decay between observations [Che et al.,\n2018, Cao et al., 2018, Rajkomar et al., 2018, Mei and\nEisner, 2017].\nWe generalize state transitions in RNNs to continuous-\ntime dynamics speci\ufb01ed by a neural network, as in Neural\nODEs [Chen et al., 2018]. We call this model the ODE-\nRNN, and use it to contruct two distinct continuous-time\nmodels. First, we use it as a standalone autoregressive\nmodel. Second, we re\ufb01ne the Latent ODE model of Chen\net al. [2018] by using the ODE-RNN as a recognition\nnetwork. Latent ODEs de\ufb01ne a generative process over\ntime series based on the deterministic evolution of an\ninitial latent state, and can be trained as a variational\nautoencoder [Kingma and Welling, 2013]. Both models\n\nStandard RNN\n\nRNN-Decay\n\nNeural ODE\n\nODE-RNN\n\nTime\n\nFigure 1: Hidden state trajectories. Ver-\ntical lines show observation times. Lines\nshow different dimensions of the hidden\nstate. Standard RNNs have constant or\nunde\ufb01ned hidden states between observa-\ntions. The RNN-Decay model has states\nwhich exponentially decay towards zero,\nand are updated at observations. States\nof Neural ODE follow a complex trajec-\ntory but are determined by the initial state.\nThe ODE-RNN model has states which\nobey an ODE between observations, and\nare also updated at observations.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fnaturally handle time gaps between observations, and remove the need to group observations into\nequally-timed bins. We compare ODE models to several RNN variants and \ufb01nd that ODE-RNNs can\nperform better when the data is sparse. Since the absence of observations itself can be informative,\nwe further augment Latent ODEs to jointly model times of observations using a Poisson process.\n\n2 Background\n\nhi = RNNCell(hi\u22121, \u2206t, xi)\n\nRecurrent neural networks A simple way to handle irregularly-timed samples is to include the\ntime gap between observations \u2206t = ti \u2212 ti\u22121 into the update function of the RNN:\n\n(1)\nHowever, this approach raises the question of how to de\ufb01ne the hidden state h between observations.\nA simple alternative introduces an exponential decay of the hidden state towards zero when no\nobservations are made [Che et al., 2018, Cao et al., 2018, Rajkomar et al., 2018, Mozer et al., 2017]:\n(2)\nwhere \u03c4 is a decay rate parameter. However, Mozer et al. [2017] found that empirically, exponential-\ndecay dynamics did not improve predictive performance over standard RNN approaches.\n\nhi = RNNCell(hi\u22121 \u00b7 exp{\u2212\u03c4 \u2206t}, xi)\n\nNeural Ordinary Differential Equations Neural ODEs [Chen et al., 2018] are a family of\ncontinuous-time models which de\ufb01ne a hidden state h(t) as a solution to ODE initial-value problem:\n\ndh(t)\n\ndt\n\n= f\u03b8(h(t), t) where h(t0) = h0\n\n(3)\n\nin which the function f\u03b8 speci\ufb01es the dynamics of the hidden state, using a neural network with\nparameters \u03b8. The hidden state h(t) is de\ufb01ned at all times, and can be evaluated at any desired times\nusing a numerical ODE solver:\n\nh0, . . . , hN = ODESolve(f\u03b8, h0, (t0, . . . , tN ))\n\n(4)\nChen et al. [2018] used the adjoint sensitivity method [Pontryagin et al., 1962] to compute memory-\nef\ufb01cient gradients w.r.t. \u03b8 for training ODE-based deep learning models using black-box ODE solvers.\nThey also conducted toy experiments in a time-series model in which the latent state follows a Neural\nODE. Chen et al. [2018] used time-invariant dynamics in their time-series model: dh(t)/dt = f\u03b8(h(t)) ,\nand we follow the same approach, but adding time-dependence would be straightforward if necessary.\n\n3 Method\n\nIn this section, we use neural ODEs to de\ufb01ne two distinct families of continuous-time models: the\nautoregressive ODE-RNN, and the variational-autoencoder-based Latent ODE.\n\n3.1 Constructing an ODE-RNN Hybrid\n\nFollowing Mozer et al. [2017], we note that an RNN with exponentially-decayed hidden state\nimplicitly obeys the following ODE dh(t)\ndt = \u2212\u03c4 h with h(t0) = h0, where \u03c4 is a parameter of the\nmodel. The solution to this ODE is the pre-update term h0 \u00b7 exp{\u2212\u03c4 \u2206t} in (2). This differential\nequation is time-invariant, and assumes that the stationary point (i.e. zero-valued state) is special. We\ncan generalize this approach and model the hidden state using a Neural ODE. The resulting algorithm\nis given in Algorithm 1. We de\ufb01ne the state between observations to be the solution to an ODE:\nh(cid:48)\ni = ODESolve(f\u03b8, hi\u22121, (ti\u22121, ti)) and then at each observation, update the hidden state using a\nstandard RNN update hi = RNNCell(h(cid:48)\ni, xi). Our model does not explicitly depend on t or \u2206t when\nupdating the hidden state, but does depend on time implicitly through the resulting dynamical system.\nCompared to RNNs with exponential decay, our approach allows a more \ufb02exible parameterization of\nthe dynamics. A comparison between the state dynamics of these models is given in table 2.\n\ni.e. they factor the joint density p(x) =(cid:81)\n\nAutoregressive Modeling with the ODE-RNN The ODE-RNN can straightforwardly be used\ni=0 at times {ti}N\nto probabilistically model sequences. Consider a series of observations {xi}N\ni=0.\nAutoregressive models make a one-step-ahead prediction conditioned on the history of observations,\ni p\u03b8(xi|xi\u22121, . . . , x0). As in standard RNNs, we can use\n\nan ODE-RNN to specify the conditional distributions p\u03b8(xi|xi\u22121...x0) (Algorithm 1).\n\n2\n\n\fAlgorithm 1 The ODE-RNN. The only difference, highlighted in blue, from standard RNNs is that\nthe pre-activations h(cid:48) evolve according to an ODE between observations, instead of being \ufb01xed.\nInput: Data points and their timestamps {(xi, ti)}i=1..N\nh0 = 0\nfor i in 1, 2, . . . , N do\nh(cid:48)\ni = ODESolve(f\u03b8, hi\u22121, (ti\u22121, ti))\nhi = RNNCell(h(cid:48)\n\n(cid:46) Solve ODE to get state at ti\n(cid:46) Update hidden state given current observation xi\n\ni, xi)\n\nend for\noi = OutputNN(hi) for all i = 1..N\nReturn: {oi}i=1..N ; hN\n\n3.2 Latent ODEs: a Latent-variable Construction\n\nAutoregressive models such as RNNs and the ODE-RNN presented above are easy to train and allow\nfast online predictions. However, autoregressive models can be hard to interpret, since their update\nfunction combines both their model of system dynamics, and of conditioning on new observations.\nFurthermore, their hidden state does not explicitly encode uncertainty about the state of the true\nsystem. In terms of predictive accuracy, autoregressive models are often suf\ufb01cient for densely sampled\ndata, but perform worse when observations are sparse.\nAn alternative to autoregressive models are latent-variable models. For example, Chen et al. [2018]\nproposed a latent-variable time series model, where the generative model is de\ufb01ned by ODE whose\ninitial latent state z0 determines the entire trajectory:\n\nz0 \u223c p(z0)\n\nz0, z1, . . . , zN = ODESolve(f\u03b8, z0, (t0, t1, . . . , tN ))\n\neach xi\n\nindep.\n\n\u223c p(xi|zi)\n\ni = 0, 1, . . . , N\n\n(5)\n(6)\n\n(7)\n\nEncoder Decoder\nODE\nODE\nRNN\n\nODE-RNN\nRNN\nRNN\n\nEncoder-decoder models\nLatent ODE (ODE enc.)\nLatent ODE (RNN enc.)\nRNN-VAE\n\nWe follow Chen et al. [2018] in using a varia-\ntional autoencoder framework for both training\nand prediction. This requires estimating the ap-\ni=0). Inference\nproximate posterior q(z0|{xi, ti}N\nand prediction in this model is effectively an\nencoder-decoder or sequence-to-sequence archi-\ntecture, in which a variable-length sequence is\nencoded into a \ufb01xed-dimensional embedding, which is then decoded into another variable-length\nsequence, as in Sutskever et al. [2014].\nChen et al. [2018] used an RNN as a recognition network to compute this approximate posterior.\nWe conjecture that using an ODE-RNN as de\ufb01ned above for the recognition network would be a\nmore effective parameterization when the datapoints are irregularly sampled. Thus, we propose using\nan ODE-RNN as the encoder for a latent ODE model, resulting in a fully ODE-based sequence-to-\nsequence model. In our approach, the mean and standard deviation of the approximate posterior\nq(z0|{xi, ti}N\n\ni=0) are a function of the \ufb01nal hidden state of an ODE-RNN:\n\nTable 1: Different encoder-decoder architectures.\n\nq(z0|{xi, ti}N\n\ni=0) = N (\u00b5z0, \u03c3z0 ) where \u00b5z0 , \u03c3z0 = g(ODE-RNN\u03c6({xi, ti}N\n\n(8)\nWhere g is a neural network translating the \ufb01nal hidden state of the ODE-RNN encoder into the mean\nand variance of z0. To get the approximate posterior at time point t0, we run the ODE-RNN encoder\nbackwards-in-time from tN to t0. We jointly train both the encoder and decoder by maximizing the\n\ni=0))\n\nTable 2: De\ufb01nition of hidden state h(t) between\nobservation times ti\u22121 and ti in autoregressive\nmodels. In standard RNNs, the hidden state does\nnot change between updates. In ODE-RNNs,\nthe hidden state is de\ufb01ned by an ODE, and is\nadditionally updated by another network at each\nobservation.\n\nModel\nStandard RNN\nRNN-Decay\nGRU-D\nODE-RNN\n\nState h(ti) between observations\nhti\u22121\nhti\u22121 e\u2212\u03c4 \u2206t\nhti\u22121 e\u2212\u03c4 \u2206t\nODESolve(f\u03b8, hi\u22121, (ti\u22121, t))\n\n3\n\n\fFigure 2: The Latent ODE model with an ODE-RNN encoder. To make predictions in this model, the\nODE-RNN encoder is run backwards in time to produce an approximate posterior over the initial\ni=0). Given a sample of z0, we can \ufb01nd the latent state at any point of interest by\nstate: q(z0|{xi, ti}N\nsolving an ODE initial-value problem. Figure adapted from Chen et al. [2018].\n\nevidence lower bound (ELBO):\n\ni=0) [log p\u03b8(x0, . . . , xN ))] \u2212 KL(cid:2)q\u03c6(z0|{xi, ti}N\n\ni=0)||p(z0)(cid:3) (9)\n\nz0\u223cq\u03c6(z0|{xi,ti}N\n\nELBO(\u03b8, \u03c6) = E\nThis latent variable framework comes with several bene\ufb01ts: First, it explicitly decouples the dynamics\nof the system (ODE), the likelihood of observations, and the recognition model, allowing each to be\nexamined or speci\ufb01ed on its own. Second, the posterior distribution over latent states provides an\nexplicit measure of uncertainty, which is not available in standard RNNs and ODE-RNNs. Finally, it\nbecomes easier to answer non-standard queries, such as making predictions backwards in time, or\nconditioning on a subset of observations.\n\n3.3 Poisson process likelihoods\n\nDiastolic arterial blood pressure\n\nThe fact that a measurement was made at a particular time is often\ninformative about the state of the system [Che et al., 2018]. In the\nODE framework, we can use the continuous latent state to param-\neterize the intensity of events using aninhomogeneous Poisson point\nprocess [Palm, 1943] where the event rate \u03bb(t) changes over time.\nPoisson point processes have the following log-likelihood:\n\nN(cid:88)\n\n(cid:90) tend\n\nPartial pressure of arterial O2\n\ntstart\n\ni=1\n\n\u03bb(t)dt\n\nlog \u03bb(ti) \u2212\n\nlog p(t1, . . . , tN|tstart, tend, \u03bb(\u00b7)) =\nWhere tstart and tend are the times at which observations started and\nstopped being recorded.\nWe augment the Latent ODE framework with a Poisson process over\nthe observation times, where we parameterize \u03bb(t) as a function\nof z(t). This means that instead of specifying and maximizing\nthe conditional marginal likelihood p(x1, . . . , xN|t1, . . . , tN , \u03b8), we\ncan instead specify and maximizing the joint marginal likelihood\np(x1, . . . , xN , t1, . . . , tN ,|\u03b8). To compute the joint likelihood, we\ncan evaluate the Poisson intensity \u03bb(t), precisely estimate its integral,\nand the compute latent states at all required time points, using a\nsingle call to an ODE solver.\nMei and Eisner [2017] used a similar approach, but relied on a \ufb01xed time discretization to estimate\nthe Poisson intensity. Chen et al. [2018] showed a toy example of using Latent ODEs with a Poisson\nprocess likelihood to \ufb01t latent dynamics from observation times alone. In section 4.4, we incorporate\na Poisson process likelihood into a latent ODE to model observation rates in medical data.\n\nFigure 3: Visualization of\nthe inferred Poisson rate \u03bb(t)\n(green line) for two selected\nfeatures of different patients\nfrom the Physionet dataset.\nVertical lines mark observa-\ntion times.\n\n3.4 Batching and computational complexity\n\nOne computational dif\ufb01culty that arises from irregularly-sampled data is that observation times can\nbe different for each time series in a minibatch. In order to solve all ODEs in a minibatch in sync, we\nmust we must output the solution of the combined ODE at the union of all time points in the batch.\nTaking the union of time points does not substantially hurt the runtime of the ODE solver, as the\nadaptive time stepping in ODE solvers is not sensitive to the number of time points (t1...tN ) at which\n\n4\n\n\u00b5tNt1t0~t0t1tNq(z0|x0..xN)z0z1zizNODESolve(f,z0,(t0..tN))GRUGRUGRUGRUODEODExNx1x0xi\u02c6xi\u02c6xN\u02c6x1\u02c6x0010203040Time (hours)0204060Inferred rate010203040Time (hours)050100Inferred rate\f10 observed points\n\n30 observed points\n\n50 observed points\n\nTime\nTime\n(a) Conditioning on increasing number of observations\n\nTime\n\nTime\n\n(b) Prior samples\n\nFigure 4: (a) A Latent ODE model conditioned on a small subset of points. This model, trained\non exactly 30 observations per time series, still correctly extrapolates when more observations are\n\nprovided. (b) Trajectories sampled from the prior p(z0) \u223c Normal(cid:0)z0; 0, I(cid:1) of the trained model,\n\nthen decoded into observation space.\n\nthe solver outputs the state. Instead, it depends on the length on the time interval [t1, tN ] and the\ncomplexity of the dynamics. (see suppl. \ufb01gure 3). Thus, ODE-RNNs and Latent ODEs have a similar\nasymptotic time complexity to standard RNN models. However, as the ODE must be continuously\nsolved even when no observations occur, the compute cost does not scale with the sparsity of the data,\nas it does in decay-RNNs. In our experiments, we found that the ODE-RNN takes 60% more time\nthan the standard GRU to evaluate, and the Latent ODE required roughly twice the amount of time to\nevaluate than the ODE-RNN.\n\n3.5 When should you use an ODE-based model over a standard RNN?\n\nStandard RNNs are ignore the time gaps between points. As such, standard RNNs work well on\nregularly spaced data, with few missing values, or when the time intervals between points are short.\nModels with continuous-time latent state, such as the ODE-RNN or RNN-Decay, can be evaluated\nat any desired time point, and therefore are suitable for interpolation tasks. In these models, the\nfuture hidden states depend on the time since the last observation, also making them better suited\nfor sparse and/or irregular data than standard RNNs. RNN-Decay enforces that the hidden state\nconverges monontically to a \ufb01xed point over time. In ODE-RNNs the form of the dynamics between\nthe observations is learned rather than pre-de\ufb01ned. Thus, ODE-RNNs can be used on sparse and/or\nirregular data without making strong assumptions about the dynamics of the time series.\n\npute the joint distribution p(x) =(cid:81)\nODE-RNNs). We call models of the form p(x) =(cid:82)(cid:81)\n\nLatent variable models versus autoregressive models We refer to models which iteratively com-\ni p\u03b8(xi|xi\u22121, . . . , x0) as autoregressive models (e.g. RNNs and\ni p(xi|z0)p(z0)dz0 latent-variable models (e.g.\nLatent ODEs and RNN-VAEs).\nIn autoregressive models, both the dynamics and the conditioning on data are encoded implicitly\nthrough the hidden state updates, which makes them hard to interpret. In contrast, encoder-decoder\nmodels (Latent ODE and RNN-VAE) represent state explicitly through a vector zt, and represent\ndynamics explicitly through a generative model. Latent states in these models can be used to compare\ndifferent time series, for e.g. clustering or classi\ufb01cation tasks, and their dynamics functions can be\nexamined to identify the types of dynamics present in the dataset.\n\n4 Experiments\n\n4.1 Toy dataset\n\nWe tested our model on a toy dataset of 1,000 periodic trajectories with variable frequency and the\nsame amplitude. We sampled the initial point from a standard Gaussian, and added Gaussian noise\nto the observations. Each trajectory has 100 irregularly-sampled time points. During training, we\nsubsample a \ufb01xed number of points at random, and attempt to reconstruct the full set of 100 points.\n\nConditioning on sparse data Latent ODEs can often reconstruct trajectories reasonably well given\na small subset of points, and provide an estimate of uncertainty over both the latent trajectories and\n\n5\n\n012Time101x10 observed points012Timex30 observed points012Timex50 observed points0.00.51.01.52.0Time0.50.00.5x\fpredicted observations. To demonstrate this, we trained a Latent ODE model to reconstruct the full\ntrajectory (100 points) from a subset of 30 points. At test time, we conditioned this model on a subset\nof 10, 30 or 50 points. Conditioning on more points results in a better \ufb01t as well as smaller variance\nacross the generated trajectories (\ufb01g. 4). Figure 4(b) demonstrates that the trajectories sampled from\nthe prior of the trained model are also periodic.\n\nExtrapolation Next, we show that a time-invariant ODE can recover stationary periodic dynamics\nfrom data automatically. Figure 5 shows a Latent ODE trained to condition on 20 points in the\n[0; 2.5] interval (red area) and predict points on [2.5; 5] interval (blue area). A Latent ODE with\nan ODE-RNN encoder was able to extrapolate the time series far beyond the training interval and\nmaintain periodic dynamics. In contrast, a Latent ODE trained with RNN encoder as in Chen et al.\n[2018] did not extrapolate the periodic dynamics well.\n\n(a) Latent ODE with RNN encoder\n\n(b) Latent ODE with ODE-RNN encoder\n\nTime\n\nTime\n\nFigure 5: (a) Approximate posterior samples from a Latent ODE trained with an RNN recognition\nnetwork, as in Chen et al. [2018]. (b) Approximate posterior samples from a Latent ODE trained with\nan ODE-RNN recognition network (ours). At training time, the Latent ODE conditions on points in\nred area, and reconstruct points in blue area. At test time, we condition the model on 20 points in red\narea, and solve the generative ODE on a larger time interval.\n\n4.2 Quantitative Evaluation\n\nWe evaluate the models quantitavely on two tasks: interpolation and extrapolation. On each dataset,\nwe used 80% for training and 20% for test. See the supplement a detailed description.\n\nBaselines\nIn the class of autoregressive models, we compare ODE-RNNs to standard RNNs. We\ncompared the following autoregressive models: (1) ODE-RNN (proposed) (2) A classic RNN where\n\u2206t is concatenated to the input (RNN-\u2206t) (3) An RNN with exponential decay on the hidden\nstates h \u00b7 e\u2212\u03c4 \u2206t (RNN-Decay) (4) An RNN with missing values imputed by a weighted average of\nprevious value and empirical mean (RNN-Impute), and (5) GRU-D [Che et al., 2018] which combines\nexponential decay and the above imputation strategy. Among encoder-decoder models, we compare\nthe Latent ODE to a variational autoencoder in which both the encoder and decoder are recurrent\nneural nets (RNN-VAE). The ODE-RNN can use any hidden state update formula for the RNNCell\nfunction in Algorithm 1. Throughout our experiments, we use the Gated Recurrent Unit (GRU) [Cho\net al., 2014]. See the supplement for the architecture details.\n\nInterpolation The standard RNN and the ODE-RNN are straightforward to apply to the interpola-\ntion task. To perform interpolation with a Latent ODE, we encode the time series backwards in time,\ncompute the approximate posterior q(z0|{xi, ti}N\ni=0) at the \ufb01rst time point t0, sample the initial state\nof ODE z0, and generate mean observations at each observation time.\n\nExtrapolation In the extrapolation setting, we use the standard RNN or ODE-RNN trained on the\ninterpolation task, and then extrapolate the sequence by re-feeding previous predictions. To encourage\nextrapolation, we used scheduled sampling [Bengio et al., 2015], feeding previous predictions instead\nof observed data with probability 0.5 during training. One might expect that directly optimizing for\nextrapolation would perform best at extrapolation. Such a model would resemble an encoder-decoder\nmodel, which we consider separately below (the RNN-VAE). For extrapolation in encoder-decoder\nmodels, including the Latent ODE, we split the timeline in half. We encode the observations in the\n\ufb01rst half forward in time and reconstruct the second half.\n\n6\n\n0510152025Time02x0510152025Time0.02.55.0x\f4.3 MuJoCo Physics Simulation\n\nNext, we demonstrated that ODE-based models can learn an approximation to simple Newtonian\nphysics. To show this, we created a physical simulation using the \u201cHopper\u201d model from the Deepmind\nControl Suite [Tassa et al., 2018]. We randomly sampled the initial position of the hopper and initial\nvelocities such that hopper rotates in the air and falls on the ground (\ufb01gure 6). These trajectories are\ndeterministic functions of their initial states, which matches the assumptions made by the Latent ODE.\nThe dataset is 14-dimensional, and we model it with a 15-dimensional latent state. We generated\n10,000 sequences of 100 regularly-sampled time points each.\nWe perform both interpolation and extrapolation tasks on the MuJoCo dataset. During training, we\nsubsampled a small percentage of time points to simulate sparse observation times. For evaluation,\nwe measured the mean squared error (MSE) on the full time series.\n\nTable 3: Test Mean Squared Error (MSE) (\u00d710\u22122) on the MuJoCo dataset.\n\nModel\n\nRNN GRU-D\nODE-RNN (Ours)\n\ng RNN \u2206t\n\ne\nr\no\nt\nu\nA\nc RNN-VAE\ne\nD\n-\nc\nn\nE\n\nLatent ODE (RNN enc.)\nLatent ODE (ODE enc, ours)\n\nInterpolation (% Observed Pts.)\n50%\n10%\n0.785\n2.454\n0.748\n1.968\n1.647\n0.665\n6.100\n6.514\n0.447\n2.477\n0.360\n0.285\n\n30%\n1.250\n1.134\n0.986\n6.305\n2.768\n0.300\n\n20%\n1.714\n1.421\n1.209\n6.408\n0.578\n0.295\n\nExtrapolation (% Observed Pts.)\n10%\n7.259\n38.130\n13.508\n2.378\n1.663\n1.441\n\n20%\n6.792\n20.041\n31.950\n2.135\n1.653\n1.400\n\n30%\n6.594\n13.049\n15.465\n2.021\n1.485\n1.175\n\n50%\n30.571\n5.833\n26.463\n1.782\n1.377\n1.258\n\nTable 3 shows mean squared error for models trained on different percentages of observed points.\nLatent ODEs outperformed standard RNN-VAEs on both interpolation and extrapolation. Our ODE-\nRNN model also outperforms standard RNNs on the interpolation task. The gap in performance\nbetween RNN and ODE-RNN increases with sparser data. Notably, the Latent ODE (an encoder-\ndecoder model) shows better performance than the ODE-RNN (an autoregressive model).\nAll autoregressive models performed poorly at extrapolation. This is expected, as they were only\ntrained for one-step-ahead prediction, although standard RNNs performed better than ODE-RNNs.\nLatent ODEs outperformed RNN-VAEs on the extrapolation task.\n\nInterpretability of the latent state Figure 6 shows how the norm of the latent state time-derivative\nf\u03b8(z) changes with time for two reconstructed MuJoCo trajectories. When the hopper hits the ground,\nthere is a spike in the norm of the ODE function. In contrast, when the hopper is lying on the ground,\nthe norm of the dynamics is small.\nFigure 7 shows the entropy of the approximate posterior q(z0|{xi, ti}N\ni=0) of a trained model con-\nditioned on different numbers of observations. The average entropy (uncertainty) monotonically\ndecreases as more points are observed. Figure 8 shows the latent state z0 projected to 2D using\nUMAP [McInnes et al., 2018]. The latent state corresponds closely to the physical parameters of the\ntrue simulation that most strongly determine the future trajectory of the hopper: distance from the\nground, initial velocity on z-axis, and relative position of the leg of the hopper.\n\nTruth\nLatent\nODE\n||f (z)||\n(ODE)\n||\u2206h||\n(RNN)\n\nTime\n\nTime\n\nFigure 6: Top row: True trajectories from MuJoCo dataset. Second row: Trajectories reconstructed\nby a latent ODE model. Third row: Norm of the dynamics function f\u03b8 in the latent space of the latent\nODE model. Fourth row: Norm of the hidden state of a RNN trained on the same dataset.\n\n7\n\n\fFigure 7: Entropy of the approxi-\nmate posterior over z0 versus num-\nber of observed time points. The\nline shows the mean; shaded area\nshows 10% and 90% percentiles es-\ntimated over 1000 trajectories\n\nTable 4: Test MSE (mean \u00b1 std) on\nPhysioNet. Autoregressive models.\nInterp (\u00d710\u22123)\nModel\n3.520 \u00b1 0.276\nRNN \u2206t\n3.243 \u00b1 0.275\nRNN-Impute\nRNN-Decay\n3.215 \u00b1 0.276\nRNN GRU-D\n3.384 \u00b1 0.274\n2.361 \u00b1 0.086\nODE-RNN (Ours)\n\n4.4 Physionet\n\n(a) Height\n\n(b) Velocity\n\n(c) Hip Position\n\nFigure 8: Nonlinear projection of latent space of z0 from a\nLatent ODE model trained on the MuJoCo dataset). Each point\nis the encoding of one time series. The points are colored\nby the (a) initial height (distance from the ground) (b) initial\nvelocity in z-axis (c) relative initial position of the hip of the\nhopper. The latent state corresponds closely to the physical\nparameters of the true simulation.\n\nTable 5: Test MSE (mean \u00b1 std) on PhysioNet.\n\nModel\nRNN-VAE\nLatent ODE (RNN enc.)\nLatent ODE (ODE enc)\nLatent ODE + Poisson\n\nEncoder-decoder models.\nInterp (\u00d710\u22123)\n5.930 \u00b1 0.249\n3.907 \u00b1 0.252\n2.118 \u00b1 0.271\n2.789 \u00b1 0.771\n\nExtrap (\u00d710\u22123)\n3.055 \u00b1 0.145\n3.162 \u00b1 0.052\n2.231 \u00b1 0.029\n2.208 \u00b1 0.050\n\nWe evaluated our model on the PhysioNet Challenge 2012 dataset [Silva et al., 2012], which contains\n8000 time series, each containing measurements from the \ufb01rst 48 hours of a different patient\u2019s\nadmission to ICU. Measurements were made at irregular times, and of varying sparse subsets of the\n37 possible features.\nMost existing approaches to modeling this data use a coarse discretization of the aggregated mea-\nsurements per hour [Che et al., 2018], which forces the model to train on only one-twentieth of\nmeasurements. In contrast, our approach, in principle, does not require any discretization or aggrega-\ntion of measurements. To speed up training, we rounded the observation times to the nearest minute,\nreducing the number of measurements only 2-fold. Hence, there are still 2880 (60*48) possible\nmeasurement times per time series under our model\u2019s preprocessing, while the previous standard\nwas to used only 48 possible measurement times. We used 20 latent dimensions in the latent ODE\ngenerative model. See supplement for more details on hyperparameters. Tables 4 and 5 report mean\nsquared error averaged over runs with different random seeds, and their standard deviations. We run\none-sided t-test to establish a statistical signi\ufb01cance. Best models are marked in bold. ODE-based\nmodels have smaller mean squared error than RNN baselines on this dataset.\nFinally, we constructed binary classi\ufb01ers based on each model type to predict in-hospital mortality.\nWe passed the hidden state at the last measured time point into a two-layer binary classi\ufb01er. Due\nto class imbalance (13.75% samples with positive label), we report test area under curve (AUC)\ninstead of accuracy. Table 6 shows that the ODE-RNN, Latent ODE and GRU-D achieved the similar\nclassi\ufb01cation AUC. A possible explanation is that modelling dynamics between time points does not\nmake a difference for binary classi\ufb01cation of the full time series.\nWe also included a Poisson Process likelihood on observation times, jointly trained with the Latent\nODE model. Figure 3 shows the inferred measurement rate on a patient from the dataset. Although\nthe Poisson process was able to model observation times reasonably well, including this likelihood\nterm did not improve classi\ufb01cation accuracy.\n\n4.5 Human Activity dataset\n\nWe trained the same classi\ufb01er models as above on the Human Activity dataset, which contains\ntime series from \ufb01ve individuals performing various activities: walking, sitting, lying, etc. The\n\n8\n\n20406080100Numberoftimepoints0.10.2H[z0]2024h142024h22024h142024h22024h142024h2\fTable 6: Per-sequence classi\ufb01cation.\n\nAUC on Physionet.\n\nMethod\nRNN \u2206t\nRNN-Impute\nRNN-Decay\nRNN GRU-D\nRNN-VAE\nLatent ODE (RNN enc.)\nODE-RNN\nLatent ODE (ODE enc)\nLatent ODE + Poisson\n\nAUC\n\n0.787 \u00b1 0.014\n0.764 \u00b1 0.016\n0.807 \u00b1 0.003\n0.818 \u00b1 0.008\n0.515 \u00b1 0.040\n0.781 \u00b1 0.018\n0.833 \u00b1 0.009\n0.829 \u00b1 0.004\n0.826 \u00b1 0.007\n\nTable 7: Per-time-point classi\ufb01cation.\n\nAccuracy on Human Activity.\n\nMethod\nRNN \u2206t\nRNN-Impute\nRNN-Decay\nRNN GRU-D\nRNN-VAE\nLatent ODE (RNN enc.)\nODE-RNN\nLatent ODE (ODE enc)\n\nAccuracy\n\n0.797 \u00b1 0.003\n0.795 \u00b1 0.008\n0.800 \u00b1 0.010\n0.806 \u00b1 0.007\n0.343 \u00b1 0.040\n0.835 \u00b1 0.010\n0.829 \u00b1 0.016\n0.846 \u00b1 0.013\n\ndata consists of 3d positions of tags attached to their belt, chest and ankles (12 features in total).\nAfter preprocessing, the dataset has 6554 sequences of 211 time points (details in supplement). The\ntask is to classify each time point into one of seven types of activities (walking, sitting, etc.). We\nused a 15-dimensional latent state (more details in the supplement). Table 7 shows that the Latent\nODE-based classi\ufb01er had higher accuracy than the ODE-RNN classi\ufb01er on this task.\n\n5 Related work\n\nStandard RNNs treat observations as a sequence of tokens, not accounting for variable gaps between\nobservations. One way to accommodate this is to discretize the timeline into equal intervals, impute\nmissing data, and then run an RNN on the imputed inputs. To perform imputation, Che et al. [2018]\nused a weighted average between the empirical mean and the previous observation. Others have used\na separate interpolation network [Shukla and Marlin, 2019], Gaussian processes [Futoma et al., 2017],\nor generative adversarial networks [Luo et al., 2018] to perform interpolation and imputation prior to\nrunning an RNN on time-discretized inputs. In contrast, Lipton et al. [2016] used a binary mask to\nindicate the missing measurements and reported that RNNs performs better with zero-\ufb01lling than\nwith imputed values. They note that such methods can be sensitive to the discretization granularity.\nAnother approach is to directly incorporate the time gaps between observations into RNN. The\nsimplest approach is to append the time gap \u2206t to the RNN input. However, Mozer et al. [2017]\nsuggested that appending \u2206t makes the model prone to over\ufb01tting, and found empirically that it did\nnot improve predictive performance. Another solution is to introduce the hidden states that decay\nexponentially over time [Che et al., 2018, Cao et al., 2018, Rajkomar et al., 2018].\nMei and Eisner [2017] used hidden states with exponential decay to parametrize neural Hawkes\nprocesses, and explicitly modeled observation intensities. Hawkes processes are self-exciting pro-\ncesses whose latent state changes at each observation event. This architecture is similar to our\nODE-RNN. In contrast, the Latent ODE model assumes that observations do not affect the latent state,\nbut only affect the model\u2019s posterior over latent states, and is more appropriate when observations\n(such as taking a patient\u2019s temperature) do not substantially alter their state. Ayed et al. [2019]\nused a Neural-ODE-based framework to learn the initial state and ODE parameters from a physical\nsimulation. Concurrent work by De Brouwer et al. [2019] proposed an autoregressive model with\nODE-based transitions between observation times and Bayesian updates of the hidden states.\n\n6 Discussion and conclusion\n\nWe introduced a family of time series models, ODE-RNNs, whose hidden state dynamics are\nspeci\ufb01ed by neural ordinary differential equations (Neural ODEs). We \ufb01rst investigated this model as\na standalone re\ufb01nement of RNNs. We also used this model to improve the recognition networks of a\nvariational autoencoder model known as Latent ODEs. Latent ODEs provide relatively interpretable\nlatent states, as well explicit uncertainty estimates about latent states. Neither model requires\ndiscretizing observation times, or imputing data as a preprocessing step, making them suitable for\nthe irregularly-sampled time series data common in many applications. Finally, we demonstrate that\ncontinuous-time latent states can be combined with Poisson process likelihoods to model the rates at\nwhich observations are made.\n\n9\n\n\fAcknowledgments\n\nWe thank Chun-Hao Chang, Chris Cremer, Quaid Morris, and Ladislav Rampasek for helpful\ndiscussions and feedback. We thank the Vector Institute for providing computational resources.\n\nReferences\nIbrahim Ayed, Emmanuel de B\u00e9zenac, Arthur Pajot, Julien Brajard, and Patrick Gallinari. Learning\nDynamical Systems from Partial Observations. arXiv e-prints, art. arXiv:1902.11136, Feb 2019.\n\nSamy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence\nprediction with recurrent neural networks. In Proceedings of the 28th International Conference on\nNeural Information Processing Systems - Volume 1, NIPS\u201915, pages 1171\u20131179, Cambridge, MA,\nUSA, 2015. MIT Press.\n\nCao, Wei, Wang, Dong, Li, Jian, Zhou, Hao, Li, Lei, and et al. Brits: Bidirectional recurrent\n\nimputation for time series, May 2018. URL https://arxiv.org/abs/1805.10572.\n\nZhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent\nNeural Networks for Multivariate Time Series with Missing Values. Scienti\ufb01c Reports, 8(1):6085,\n2018. ISSN 2045-2322. doi: 10.1038/s41598-018-24271-9.\n\nRicky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary\ndifferential equations. In Advances in Neural Information Processing Systems 31, pages 6571\u20136583.\nCurran Associates, Inc., 2018.\n\nCho, Kyunghyun, van Merrienboer, Bart, and Yoshua. On the properties of neural machine translation:\n\nEncoder-decoder approaches, Oct 2014. URL https://arxiv.org/abs/1409.1259.\n\nEdward De Brouwer, Jaak Simm, Adam Arany, and Yves Moreau. GRU-ODE-Bayes: Continuous\nmodeling of sporadically-observed time series. arXiv e-prints, art. arXiv:1905.12374, May 2019.\n\nJoseph Futoma, Sanjay Hariharan, and Katherine Heller. Learning to detect sepsis with a multitask\nGaussian process RNN classi\ufb01er. In Doina Precup and Yee Whye Teh, editors, Proceedings of\nthe 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine\nLearning Research, pages 1174\u20131182, International Convention Centre, Sydney, Australia, 06\u201311\nAug 2017. PMLR.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\narXiv:1312.6114, 2013.\n\narXiv preprint\n\nZachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences\nwith rnns: Improved classi\ufb01cation of clinical time series. In Finale Doshi-Velez, Jim Fackler, David\nKale, Byron Wallace, and Jenna Wiens, editors, Proceedings of the 1st Machine Learning for\nHealthcare Conference, volume 56 of Proceedings of Machine Learning Research, pages 253\u2013270,\nChildren\u2019s Hospital LA, Los Angeles, CA, USA, 18\u201319 Aug 2016. PMLR.\n\nYonghong Luo, Xiangrui Cai, Ying ZHANG, Jun Xu, and Yuan xiaojie. Multivariate time series impu-\ntation with generative adversarial networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31,\npages 1596\u20131607. Curran Associates, Inc., 2018.\n\nLeland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and\n\nProjection for Dimension Reduction. arXiv e-prints, art. arXiv:1802.03426, Feb 2018.\n\nHongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating\nmultivariate point process.\nIn I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30,\npages 6754\u20136764. Curran Associates, Inc., 2017.\n\nMozer, Michael C., Kazakov, Denis, Lindsey, and Robert V. Discrete event, continuous time rnns,\n\nOct 2017. URL https://arxiv.org/abs/1710.04110.\n\nConny Palm. Intensit\u00e4tsschwankungen im fernsprechverker. Ericsson Technics, 1943.\n\n10\n\n\fLev Semenovich Pontryagin, EF Mishchenko, VG Boltyanskii, and RV Gamkrelidze. The mathemat-\n\nical theory of optimal processes. 1962.\n\nAlvin Rajkomar, Eyal Oren, Kai Chen, Andrew M. Dai, Nissan Hajaj, Peter J. Liu, Xiaobing Liu,\nMimi Sun, Patrik Sundberg, Hector Yee, Kun Zhang, Gavin Duggan, Gerardo Flores, Michaela\nHardt, Jamie Irvine, Quoc Le, Kurt Litsch, Jake Marcus, Alexander Mossin, and Jeff Dean. Scalable\nand accurate deep learning for electronic health records. npj Digital Medicine, 1, 01 2018. doi:\n10.1038/s41746-018-0029-1.\n\nSatya Narayan Shukla and Benjamin Marlin.\n\nInterpolation-prediction networks for irregularly\nIn International Conference on Learning Representations, 2019. URL\n\nsampled time series.\nhttps://openreview.net/forum?id=r1efr3C9Ym.\n\nIkaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting In-Hospital\nMortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. Computing\nin cardiology, 39:245\u2013248, 2012. ISSN 2325-8861. URL https://www.ncbi.nlm.nih.gov/\npubmed/24678516https://www.ncbi.nlm.nih.gov/pmc/PMC3965265/.\n\nIlya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.\n\nIn Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\nYuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden,\nAbbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy Lillicrap, and Martin Riedmiller.\nDeepMind Control Suite. arXiv e-prints, art. arXiv:1801.00690, Jan 2018.\n\n11\n\n\f", "award": [], "sourceid": 2858, "authors": [{"given_name": "Yulia", "family_name": "Rubanova", "institution": "University of Toronto"}, {"given_name": "Ricky T. Q.", "family_name": "Chen", "institution": "U of Toronto"}, {"given_name": "David", "family_name": "Duvenaud", "institution": "University of Toronto"}]}