{"title": "The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1952, "page_last": 1960, "abstract": "Stochastic differential equations (SDE) are a natural tool for modelling systems that are inherently noisy or contain uncertainties that can be modelled as stochastic processes. Crucial to the process of using SDE to build mathematical models is the ability to estimate parameters of those models from observed data. Over the past few decades, significant progress has been made on this problem, but we are still far from having a definitive solution. We describe a novel method of approximating a diffusion process that we show to be useful in Markov chain Monte-Carlo (MCMC) inference algorithms. We take the \u2018white\u2019 noise that drives a diffusion process and decompose it into two terms. The first is a \u2018coloured noise\u2019 term that can be deterministically controlled by a set of auxilliary variables. The second term is small and enables us to form a linear Gaussian \u2018small noise\u2019 approximation. The decomposition allows us to take a diffusion process of interest and cast it in a form that is amenable to sampling by MCMC methods. We explain why many state-of-the-art inference methods fail on highly nonlinear inference problems. We demonstrate experimentally that our method performs well in such situations. Our results show that this method is a promising new tool for use in inference and parameter estimation problems.", "full_text": "The Coloured Noise Expansion and Parameter\n\nEstimation of Diffusion Processes\n\nSimon M.J. Lyons\nSchool of Informatics\nUniversity of Edinburgh\n\n10 Crichton Street, Edinburgh, EH8 9AB\n\nS.Lyons-4@sms.ed.ac.uk\n\nSimo S\u00a8arkk\u00a8a\nAalto University\n\nDepartment of Biomedical Engineering\n\nand Computational Science\n\nRakentajanaukio 2, 02150 Espoo\n\nsimo.sarkka@aalto.fi\n\nAmos J. Storkey\n\nSchool of Informatics\nUniversity of Edinburgh\n\n10 Crichton Street, Edinburgh, EH8 9AB\n\na.storkey@ed.ac.uk\n\nAbstract\n\nStochastic differential equations (SDE) are a natural tool for modelling systems\nthat are inherently noisy or contain uncertainties that can be modelled as stochastic\nprocesses. Crucial to the process of using SDE to build mathematical models\nis the ability to estimate parameters of those models from observed data. Over\nthe past few decades, signi\ufb01cant progress has been made on this problem, but\nwe are still far from having a de\ufb01nitive solution. We describe a novel method\nof approximating a diffusion process that we show to be useful in Markov chain\nMonte-Carlo (MCMC) inference algorithms. We take the \u2018white\u2019 noise that drives\na diffusion process and decompose it into two terms. The \ufb01rst is a \u2018coloured\nnoise\u2019 term that can be deterministically controlled by a set of auxilliary variables.\nThe second term is small and enables us to form a linear Gaussian \u2018small noise\u2019\napproximation. The decomposition allows us to take a diffusion process of interest\nand cast it in a form that is amenable to sampling by MCMC methods. We explain\nwhy many state-of-the-art inference methods fail on highly nonlinear inference\nproblems, and we demonstrate experimentally that our method performs well in\nsuch situations. Our results show that this method is a promising new tool for use\nin inference and parameter estimation problems.\n\n1\n\nIntroduction\n\nDiffusion processes are a \ufb02exible and useful tool in stochastic modelling. Many important real world\nsystems are currently modelled and best understood in terms of stochastic differential equations in\ngeneral and diffusions in particular. Diffusions have been used to model prices of \ufb01nancial instru-\nments [1], chemical reactions [2], \ufb01ring patterns of individual neurons [3], weather patterns [4] and\nfMRI data [5, 6, 7] among many other phenomena.\nThe analysis of diffusions dates back to Feller and Kolmogorov, who studied them as the scaling\nlimits of certain Markov processes (see [8]). The theory of diffusion processes was revolutionised\nby It\u02c6o, who interpreted a diffusion process as the solution to a stochastic differential equation [9,\n10]. This viewpoint allows one to see a diffusion process as the randomised counterpart of an\nordinary differential equation. One can argue that stochastic differential equations are the natural\n\n1\n\n\ftool for modelling continuously evolving systems of real valued quantities that are subject to noise\nor stochastic in\ufb02uences.\nThe classical approach to mathematical modelling starts with a set of equations that describe the\nevolution of a system of interest. These equations are goverened by a set of input parameters (for\nexample particle masses, reaction rates, or more general constants of proportionality) that determine\nthe behaviour of the system. For practical purposes, it is of considerable interest to solve the inverse\nproblem. Given the output of some system, what can be said about the parameters that govern it?\nIn the present setting, we observe data which we hypothesize are generated by a diffusion. We would\nlike to know what the nature of this diffusion is. For example, we may begin with a parametric model\nof a physical system, with a prior distribution over the parameters. In principle, one can apply Bayes\u2019\ntheorem to deduce the posterior distribution. In practice, this is computationally prohibitive: it is\nnecessary to solve a partial differential equation known as the Fokker-Planck equation (see [11]) in\norder to \ufb01nd the transition density of the diffusion of interest. This solution is rarely available in\nclosed form, and must be computed numerically.\nIn this paper, we propose a novel approximation for a nonlinear diffusion process X. One heuristic\nway of thinking about a diffusion is as an ordinary differential equation that is perturbed by white\nnoise. We demonstrate that one can replace the white noise by a \u2018coloured\u2019 approximation without\ninducing much error. The nature of the coloured noise expansion method enables us to control the\nbehaviour of the diffusion over various length-scales. This allows us to produce samples from the\ndiffusion process that are consistent with observed data. We use these samples in a Markov chain\nMonte-Carlo (MCMC) inference algorithm.\nThe main contributions of this paper are:\n\n\u2022 Novel development of a method for sampling from the time-t marginal distribution of a\n\u2022 Demonstration that this approximation is a powerful and scalable tool for making parameter\n\ndiffusion process based on a \u2018coloured\u2019 approximation of white noise.\n\nestimation feasible for general diffusions at minimal cost.\n\nThe paper is structured as follows:\nIn\nSection 3 we conduct a brief survey of existing approaches to the problem. In Section 4, we discuss\nthe coloured noise expansion and its use in controlling the behaviour of a diffusion process. Our\ninference algorithm is described in Section 5. We describe some numerical experiments in Section 6,\nand future work is discussed in Section 7.\n\nin Section 2, we describe the structure of our problem.\n\n2 Parametric Diffusion Processes\n\nIn this section we develop the basic notation and formalism for the diffusion processes used in this\nwork. First, we assume our data are generated by observing a k-dimensional diffusion processes\nwith dynamics\n\ndXt = a\u03b8(Xt)dt + B\u03b8dWt,\n\n(1)\nwhere the initial condition is drawn from some known distribution. Observations are assumed to\noccur at times t1, . . . , tn, with ti\u2212ti\u22121 := Ti. We require that a\u03b8 : IRk \u2192 IRk is suf\ufb01ciently regular\nto guarantee the existence of a unique strong solution to (1), and we assume B\u03b8 \u2208 IRk\u00d7d. Both terms\ndepend on a set of potentially unknown parameters \u03b8 \u2208 IRd\u03b8. We impose a prior distribution p(\u03b8)\non the parameters. The driving noise W is a d-dimensional Brownian motion, and the equation is\ninterpreted in the It\u02c6o sense. Observations are subject to independent Gaussian perturbations centered\nat the true value of X. That is,\n\nX0 \u223c p(x0),\n\nYti = Xti + \u0001ti,\n\n\u0001ti \u223c N (0, \u03a3i)\n\n(2)\n\nWe use the notation X to refer to the entire sample path of the diffusion, and Xt to denote the value\nof the process at time t. We will also employ the shorthand Y1:n = {Yt1, . . . , Ytn}.\nMany systems can be modelled using the form (1). Such systems are particularly relevant in physics\nand natural sciences. In situations where this is not explicitly the case, one can often hope to reduce a\ndiffusion to this form via the Lamperti transform. One can almost always accomplish this in the uni-\nvariate case, but the multivariate setting is somewhat more involved. A\u00a8\u0131t-Sahalia [12] characterises\nthe set of multivariate diffusions to which this transform can be applied.\n\n2\n\n\f3 Background Work\n\nMost approaches to parameter estimation of diffusion processes rely on the Monte-Carlo approxi-\nmation. Beskos et al. [13] [14] employ a method based on rejection sampling to estimate parameters\nwithout introducing any discretisation error. Golightly and Wilkinson [15] extend the work of Chib\net al. [16] and Durham and Gallant [17] to construct a Gibbs sampler that can be applied to the\nparameter estimation problem.\nRoughly speaking, Gibbs samplers that exist in the literature alternate between drawing samples\nfrom some representation of the diffusion process X conditional on parameters \u03b8, and samples from\n\u03b8 conditional on the current sample path of X. Note that draws from X must be consistent with the\nobservations Y1:n.\nThe usual approach to the consistency issue is to make a proposal by conditioning a linear diffusion\nto hit some neighbourhood of the observation Yk, then to make a correction via a rejection sam-\npling [18] or a Metropolis-Hastings [16] step. However, as the inter-observation time grows, the\nqualitative difference between linear and nonlinear diffusions gets progressively more pronounced,\nand the rate of rejection grows accordingly. Figure 1 shows the disparity between a sample from a\nnonlinear process and a sample from the linear proposal. One can see that the target sample path is\nconstrained to stay near the mode \u03b3 = 2.5, whereas the proposal can move more freely. One should\nexpect to make many proposals before \ufb01nding one that \u2018behaves\u2019 like a typical draw from the true\nprocess.\n\n(a)\n\n(b)\n\nFigure 1: (a) Sample path of a double well process (see equation (18)) with \u03b1 = 2, \u03b3 = 2.5, B = 2\n(blue line). Current Gibbs samplers use linear proposals (dashed red line) with a rejection step to\ndraw conditioned nonlinear paths. In this case, the behaviour of the proposal is very different to that\nof the target, and the rate of rejection is high.\n(b) Sample path of a double well process (solid blue line) with noisy observations (red dots). We\nuse this as an initial dataset on which to test our algorithm. Parameters are \u03b1 = 2, \u03b3 = 1, B = 1.\nObservation errors have variance \u03a3 = .25.\n\nFor low-dimensional inference problems, algorithms that employ sequential Monte-Carlo (SMC)\nmethods [19] [20] typically yield good results. However, unlike the Gibbs samplers mentioned\nabove, SMC-based methods often do not scale well with dimension. The number of particles that\none needs to maintain a given accuracy is known to scale exponentially with the dimension of the\nproblem [21].\nA\u00a8\u0131t-Sahalia [12, 22] uses a deterministic technique based on Edgeworth expansions to approximate\nthe transition density. Other approaches include variational methods [23, 24] that can compute\ncontinuous time Gaussian process approximations to more general stochastic differential systems,\nas well as various non-linear Kalman \ufb01ltering and smoothing based approximations [25, 26, 27] .\n\n4 Coloured Noise Expansions and Brownian Motion\n\nWe now introduce a method of approximating a nonlinear diffusion that allows us to gain a con-\nsiderable amount of control over the behaviour of the process. Similar methods have been used\n\n3\n\n0123\u22123\u22122\u2212101tX(t)Nonlinear sample path and proposal05101520\u22123\u22122\u22121012tX(t)Sample path with noisy observations\ffor strati\ufb01ed sampling of diffusion processes [28] and the solution of stochastic partial differential\nequations [29] . One of the major challenges of using MCMC methods for parameter estimation\nin the present context is that it is typically very dif\ufb01cult to draw samples from a diffusion process\nconditional on observed data. If one only knows the initial condition of a diffusion, then it is straight-\nforward to simulate a sample path of the process. However, simulating a sample path conditional on\nboth initial and \ufb01nal conditions is a challenging problem.\nOur approximation separates the diffusion process X into the sum of a linear and nonlinear compo-\nnent. The linear component of the sum allows us to condition the approximation to \ufb01t observed data\nmore easily than in conventional methods. On the other hand, the nonlinear component captures\nthe \u2018gross\u2019 variation of a typical sample path. In this section, we \ufb01x a generic time interval [0, T ],\nthough one can apply the same derivation for any given interval Ti = ti \u2212 ti\u22121.\nHeuristically, one can think of the random process that drives the process de\ufb01ned in equation (1) as\nwhite noise. In our approximation, we project this white noise into an N-dimensional subspace of\nL2[0, T ], the Hilbert space of square-integrable functions de\ufb01ned on the interval [0, T ]. This gives\na \u2018coloured noise\u2019 process that approaches white noise asymptotically as N \u2192 \u221e. The coloured\nnoise process is then used to drive an approximation of (1). We can choose the space into which\nto project the white noise in such a way that we will gain some control over its behaviour. This is\nanalagous to the way that Fourier analysis allows us to manipulate properties of signals\nRecall that a standard Brownian motion on the interval [0, T ] is a one-dimentional Gaussian process\nwith zero mean and covariance function k(s, t) = min{s, t}. By de\ufb01nition of the It\u02c6o integral, we\ncan write\n\n(cid:90) t\n\n(cid:90) T\n\n(3)\nSuppose {\u03c6i}i\u22651 is an orthonormal basis of L2[0, T ]. We can interpret the indicator function in (3)\nas an element of L2[0, T ] and expand it in terms of the basis functions as follows:\n\ndWs =\n\nWt =\n\n0\n\n0\n\nI[0,t](s)dWs.\n\n\u221e(cid:88)\n\nI[0,t](s) =\n\n(cid:104)I[0,t](\u00b7), \u03c6i(\u00b7)(cid:105)\u03c6i(s) =\n\n\u03c6i(u)du\n\n\u03c6i(s).\n\n(cid:19)\n\n(cid:18)(cid:90) t\n\u221e(cid:88)\n(cid:33)(cid:90) t\n\ni=1\n\n0\n\n0\n\n\u03c6i(s)dWs\n\n\u03c6i(u)du.\n\n(4)\n\n(5)\n\ni=1\n\nSubstituting (4) into (3), we see that\n\n(cid:32)(cid:90) T\n\u221e(cid:88)\nWe will employ the shorthand Zi =(cid:82) T\n\nWt =\n\ni=1\n\n0\n\n0 \u03c6i(s)dWs. Since the functions {\u03c6i} are deterministic and\northonormal, we know from standard results of It\u02c6o calculus that the random variables {Zi} are i.i.d\nstandard normal.\nThe in\ufb01nite series in equation (5) can be truncated after N terms to derive an approximation, \u02c6Wt of\nBrownian motion. Taking the derivative with respect to time, the result is a \u2018coloured\u2019 approximation\nof white noise, taking the form\n\nd \u02c6Wt\ndt\n\n=\n\nZi\u03c6i(t).\n\n(6)\n\nThe multivariate approximation is similar. We seperate a d-dimensional Brownian motion into one-\ndimensional components and decompose the individual components as in (6). In principle, one can\nchoose a different value of N for each component of the Brownian motion, but for ease of exposition\nwe do not do so here. We can substitute this approximation into equation (1), which gives\n\ndXNL\n\nt\ndt\n\n= a\u03b8(XNL\n\nt ) + B\u03b8\n\n\u03a6iZi,\n\n0 \u223c p(x0),\nXNL\n\n(7)\n\nwhere \u03a6i is the diagonal d \u00d7 d matrix with entries (\u03c6i1, . . . , \u03c6id), and Zi = (Zi1, . . . , Zid)\n(cid:124).\nThis derivation is useful because equation (7) gives us an alternative to the Euler-Maruyama discreti-\nsation for sampling approximately from the time-t marginal distribution of a diffusion process. We\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\n4\n\n\f\u221a\n\ndraw coef\ufb01cients Zij from a standard normal distribution, and solve the appropriate vector-valued\nordinary differential equation. While the Euler discretisation is the de facto standard method for\nnumerical approximation of SDE, other methods do exist. Kloeden and Platen [30] discuss higher\norder methods such as the stochastic Runge-Kutta scheme [31].\nIn the Euler-Maruyama approximation, one discretises the driving Brownian motion into increments\nWti \u2212Wti\u22121 =\nTiZi. One must typically employ a \ufb01ne discretisation to get a good approximation\nto the true diffusion process. Empirically, we \ufb01nd that one needs far fewer Gaussian inputs Zi for\nan accurate representation of XT using the coloured noise approximation. This more parsimonious\nrepresentation has advantages. For example, Corlay and Pages [28] employ related ideas to conduct\nstrati\ufb01ed sampling of a diffusion process.\nThe coef\ufb01cients Zi are also more amenable to interpretation than the Gaussian increments in the\nEuler-Maruyama expansion. Suppose we have a one-dimensional process in which we use the\nFourier cosine basis\n\n\u03c6k(t) =(cid:112)2/T cos((2k \u2212 1)\u03c0t/2T ).\n\n(8)\n\nIf we change Z1 while holding the other coef\ufb01cients \ufb01xed, we will typically see a change in the\nlarge-scale behaviour of the path. On the other hand, a change in ZN will typically result in a\nchange to the small-scale oscillations in the path. The seperation of behaviours across coef\ufb01cients\ngives us a means to obtain \ufb01ne-grained control over the behaviour of a diffusion process within a\nMetropolis-Hastings algorithm.\nWe can improve our approximation by attempting to correct for the fact that we truncated the sum\nin equation (6). Instead of simply discarding the terms Zi\u03a6i for i > N, we attempt to account\nfor their effect as follows. We assume the existence of some \u2018correction\u2019 process XC such that\nX = XNL + XC. We know that the dynamics of X satisfy\n\nTaylor expanding the drift term around XNL, we see that to \ufb01rst order,\n\ndXt = a\u03b8\n\n(cid:0)XNL\n(cid:19)\n(cid:1) + Ja(XNL\n(cid:1) + B\u03b8d \u02c6Wt\n\nt )XC\n\n(cid:19)\n\nt\n\n(cid:18)\n(cid:18)\n\ndXt \u2248\n\na\u03b8\n\n=\n\na\u03b8\n\n(cid:0)XNL\n(cid:0)XNL\n\nt\n\nt\n\n(cid:1) dt + B\u03b8dWt.\n\nt + XC\n\nt\n\ndt + B\u03b8dWt\n\ndt + Ja(XNL\n\nt )XC\n\nt dt + B\u03b8\n\n(9)\n\n(10)\n\n(cid:16)\n\n(cid:17)\n\n.\n\ndWt \u2212 d \u02c6Wt\n\nHere, Ja(x) is the Jacobian matrix of the function a evaluated at x. This motivates the use of a linear\ntime-dependent approximation to the correction process. We will refer to this linear approximation\nas XL. The dynamics of XL satisfy\n\ndXL\n\nt = Ja(XNL\n\n(11)\nwhere the driving noise is the \u2018residual\u2019 term R = W \u2212 \u02c6W. Conditional on XNL, XL is a lin-\near Gaussian process, and equation (11) can be solved in semi-closed form. First, we compute a\nnumerical approximation to the solution of the homogenous matrix-valued equation\n\nt dt + B\u03b8dRt,\n\n0 = 0,\n\nt )XL\n\nXL\n\nd\ndt\n\n\u03a8(t) = Ja(XNL\n\n(12)\nOne can compute \u03a8\u22121(t) in a similar fashion via the relationship d\u03a8\u22121/dt = \u2212\u03a8\u22121(d\u03a8/dt)\u03a8\u22121.\nWe then have\n\n\u03a8(0) = In.\n\nt )\u03a8(t),\n\nXL\n\nt = \u03a8(t)\n\n= \u03a8(t)\n\n\u03a8(u)\u22121BdRu\n\n\u03a8(u)\u22121BdWu \u2212 N(cid:88)\n\n\u03a8(t)\n\n(cid:19)\n\n\u03a8(u)\u22121B\u03a6i(u)du\n\n(cid:18)(cid:90) t\n\n0\n\nZi.\n\n(13)\n\n(cid:90) t\n(cid:90) t\n\n0\n\n0\n\ni=1\n\n5\n\n\fIt follows that XL has mean 0 and covariance\n(cid:124)\n\u03a8\n\n(cid:124)\n\u03a8(u)\u22121BB\n\nk(s, t) = \u03a8(s)\n\n(u)\u22121du\n\n(t)\n\n(cid:18)(cid:90) s\u2227t\n(cid:18)(cid:90) s\n\n0\n\n\u03a8(s)\n\n\u2212 N(cid:88)\n\n(cid:19)\n(cid:19)(cid:18)(cid:90) t\n\n\u03a8\n\n(cid:124)\n\n(cid:19)(cid:124)\n\n\u03a8(u)\u22121B\u03a6i(u)du\n\n\u03a8(u)\u22121B\u03a6i(u)du\n\n(cid:124)\n\n\u03a8\n\n(t).\n\n(14)\n\ni=1\n\n0\n\n0\n\nThe process XNL is designed to capture the most signi\ufb01cant nonlinear features of the original diffu-\nsion X, while the linear process XL corrects for the truncation of the sum (6), and can be understood\nusing tools from the theory of Gaussian processes. One can think of the linear term as the result of\na \u2018small-noise\u2019 expansion about the nonlinear trajectory. Small-noise techniques have been applied\nto diffusions in the past [11], but the method described above has the advantage of being inherently\nnonlinear. In the supplement to this paper, we show that \u02c6X = XNL + XL converges to X in L2[0, T ]\nas N \u2192 \u221e under the assumption that a is Lipschitz continuous. If the drift function is linear, then\n\u02c6X = X regardless of the choice of N.\n\n5 Parameter Estimation\n\np(cid:0)\u03b8, Z1:N| XNL\n\n(cid:1) \u221d N (Y1 | XNL\n\nIn this section, we describe a novel modi\ufb01cation of the Gibbs sampler that does not suffer the draw-\nbacks of the linear proposal strategy. In Section 6, we demonstrate that for highly nonlinear problems\nit will perform signi\ufb01cantly better than standard methods because of the nonlinear component of our\napproximation.\nSuppose for now that we make a single noiseless observation at time t1 = T (for ease of notation,\nwe will assume that observations are uniformly spaced through time with ti+1 \u2212 ti = T , though this\nis not necessary). Our aim is to sample from the posterior distribution\n\n1 , k1(T, T ))N (Z1:N )p(\u03b8).\n\n1 + XL\n\n1 = Y1\n\n(15)\nWe adopt the convention that N (\u00b7| \u00b5, \u03a3) represents the normal distribution with mean \u00b5 and covari-\nance \u03a3, whereas N (\u00b7) represents the standard normal distribution. Note that we have left dependence\nof k1 on Z and \u03b8 implicit. The right-hand side of this expression allows us to evaluate the posterior\nup to proportionality; hence it can be targeted with a Metropolis-Hastings sampler.\nWith multiple observations, the situation is similar. However, we now have a set of Gaussian inputs\nZ(i) for each transition \u02c6Xi| \u02c6Xi\u22121. If we attempt to update \u03b8 and {Z(i)}i\u2264n all at once, the rate of\nrejection will be unacceptably high. For this reason, we update each Z(i) in turn, holding \u03b8 and\nthe other Gaussian inputs \ufb01xed. We draw Z(i)\u2217 from the proposal distribution, and compute XNL\u2217\nwith initial condition Yi\u22121. We also compute the covariance k\u2217\ni (T, T ) of the linear correction. The\nacceptance probability for this update is\n\ni\n\n\u03b1 = 1 \u2227 N (Yi | XNL\u2217\nN (Yi | XNL\n\ni\n\ni\n\ni (T, T ))N (Z(i)\u2217\n, k\u2217\n, ki(T, T ))N (Z(i)\n\n1:N )p(Z(i)\u2217\n1:N )p(Z(i)\n\n1:N \u2192 Z(i)\n1:N )\n1:N \u2192 Z(i)\u2217\n1:N )\n\n(16)\n\nAfter updating the Gaussian inputs, we make a global update for the \u03b8 parameter. The acceptance\nprobability for this move is\n\n\u03b1 = 1 \u2227 n(cid:89)\n\ni=1\n\nN (Yi | XNL\u2217\nN (Yi | XNL\n\ni\n\ni (T, T ))p(\u03b8\u2217)p(\u03b8\u2217 \u2192 \u03b8)\n, k\u2217\n, ki(T, T ))p(\u03b8)p(\u03b8 \u2192 \u03b8\u2217)\n\ni\n\n,\n\n(17)\n\ni\n\nand k\u2217\n\ni (T, T ) are computed using the proposed value of \u03b8\u2217.\n\nwhere XNL\u2217\nWe noted earlier that when j is large, Zj governs the small-time oscillations of the diffusion process.\nOne should not expect to gain much information about the value of Zj when we have large inter-\nobservation times. We \ufb01nd this to be the case in our experiments - the posterior distribution of\nZj:N approaches a spherical Gaussian distribution when j > 3. For this reason, we employ a\nGaussian random walk proposal in Z1 with stepsize \u03c3RW = .45, and proposals for Z2:N are drawn\nindependently from the standard normal distribution.\n\n6\n\n\fIn the presence of observation noise, we proceed roughly as before. Recall that we make obser-\nvations Yi = Xi + \u0001i. We draw proposals Z(i)\u2217\nis now\nYi\u22121 \u2212 \u0001i\u22121. However, one must make an important modi\ufb01cation to the algorithm. Suppose we\npropose an update of \u02c6Xi and it is accepted. If we subsequently propose an update for \u02c6Xi+1 and\nit is rejected, then the initial condition for \u02c6Xi+1 will be inconsistent with the current state of the\nchain (it will be Yi \u2212 \u0001i instead of Yi \u2212 \u0001\u2217\ni ). For this reason, we must propose joint updates for\n( \u02c6Xi, \u0001i, \u02c6Xi+1). If the variance of the observation noise is high, it may be more ef\ufb01cient to target the\ni } | Y1:n\n\njoint posterior distribution p(cid:0)\u03b8,{Zi\n\ni . The initial condition for XNL\n\n1:N and \u0001\u2217\n\n1:N , XL\n\n(cid:1).\n\ni\n\n6 Numerical Experiments\n\nt\n\n(cid:0)\u03b32 \u2212 X 2\n\n(cid:1) dt + BdWt.\n\nThe double-well diffusion is a widely-used benchmark for nonlinear inference problems [24, 32,\n33, 34]. It has been used to model systems that exhibit switching behaviour or bistability [11, 35].\nIt possesses nonlinear features that are suf\ufb01cient to demonstrate the shortcomings of some existing\ninference methods, and how our approach overcomes these issues. The dynamics of the process are\ngiven by\n\ndXt = \u03b1Xt\n\n(18)\nThe process X has a bimodal stationary distribution, with modes at x = \u00b1\u03b3. The parameter \u03b1\ngoverns the rate at which sample trajectories are \u2019pushed\u2019 toward either mode. If B is small in\ncomparison to \u03b1, mode-switching occurs relatively rarely.\nFigure 1(b) shows a trajectory of a double-well diffusion over 20 units of time, with observations\nat times {1, 2, . . . , 20} . We used the parameters \u03b1 = 2, \u03b3 = 1, B = 1. The variance of the\nobservation noise was set to \u03a3 = .25.\nAs we mentioned earlier, particle MCMC performs well in low-dimensional inference problems.\nFor this reason, the results of a particle MCMC inference algorithm (with N = 1, 000) particles are\nused as \u2019ground truth\u2019. Our algorithm used N = 3 Gaussian inputs with a linear correction. We\nused the Fourier cosine series (8) as an orthonormal basis. We compare our Gibbs sampler to that\nof Golightly and Wilkinson [15], for which we use an Euler discretisation with stepsize \u2206t = .05.\nEach algorithm drew 70, 000 samples from the posterior distribution, moving through the parameter\nspace in a Gaussian random walk. We placed an exponential(4) prior on \u03b3 and an exponential(1)\nprior on \u03b1 and B.\nFor this particular choice of parameters, both Gibbs samplers give a good approximation to the true\nposterior. Figure 2 shows histograms of the marginal posterior distributions of (\u03b1, \u03b3, B) for each\nalgorithm.\n\n(a) p(\u03b1|Y1:20)\n\n(b) p(\u03b3|Y1:20)\n\n(c) p(B|Y1:20)\n\nFigure 2: Marginal posterior distributions for (\u03b1, \u03b3, B) conditional on observed data. The solid\nblack line is the output of a particle MCMC method, taken as ground truth. The broken red line is\nthe output of the linear proposal method, and the broken and dotted blue line is the density estimate\nfrom the coloured noise expansion method. We see that both methods give a good approximation to\nthe ground truth.\n\nGibbs samplers that have been used in the past rely on making proposals by conditioning a lin-\near diffusion to hit a target, and subsequently accepting or rejecting those proposals. Over short\ntimescales, or for problems that are not highly nonlinear, this can be an effective strategy. However,\nas the timescale increases, the proposal and target become quite dissimilar (see Figure 1(a)).\n\n7\n\n01234500.20.40.60.81\u03b1p(\u03b1)00.511.522.500.511.52\u03b3p(\u03b3)00.511.522.500.511.52Bp(B)\fFor our second experiment, we simulate a double well process with (\u03b1, \u03b3, B) = (2, 2.5, 2). We make\nnoisy observations with ti \u2212 ti\u22121 = 3 and \u03a3 = .1. The algorithms target the posterior distribution\nover \u03b3, with \u03b1 and B \ufb01xed at their true values. From our previous discussion, one might expect the\nlinear proposal strategy to perform poorly in this more nonlinear setting. This is indeed the case. As\nin the previous experiment, we used a linear proposal Gibbs sampler with Euler stepsize dt = 0.05.\nIn the \u2018path update\u2019 stage, fewer than .01% of proposals were accepted. On the other hand, the\ncoloured noise expansion method used N = 7 Gaussian inputs with a linear correction and was able\nto approximate the posterior accurately. Figure 3 shows histograms of the results. Note the different\nscaling of the rightmost plot.\n\n(a) Particle MCMC\n\n(b) Coloured noise expansion\nmethod\n\n(c) Linear proposal method\n\nFigure 3: p(\u03b3|Y1:10, B, \u03b1) after ten observations with a relatively large inter-observation time. We\ndrew data from a double well process with (\u03b1, \u03b3, B) = (2, 2.5, 2). The coloured noise expansion\nmethod matches the ground truth, whereas the linear proposal method is inconsistent with the data.\n\n7 Discussion and Future Work\n\nWe have seen that the standard linear proposal/correction strategy can fail for highly nonlinear prob-\nlems. Our inference method avoids the linear correction step, instead targeting the posterior over\ninput variables directly. With regard to computational ef\ufb01ciency, it is dif\ufb01cult to give an authori-\ntative analysis because both our method and the linear proposal method are complex, with several\nparameters to tune. In our experiments, the algorithms terminated in a roughly similar length of time\n(though no serious attempt was made to optimise the runtime of either method).\nWith regard to our method, several questions remain open. The accuracy of our algorithm depends\non the choice of basis functions {\u03c6i}. At present, it is not clear how to make this choice optimally\nin the general setting. In the linear case, it is possible to show that one can achieve the accuracy\nof the Karhunen-Loeve decomposition, which is theoretically optimal. One can also set the error at\na single time t to zero with a judicious choice of a single basis function. We aim to present these\nresults in a paper that is currently under preparation.\nWe used a Taylor expansion to compute the covariance of the correction term. However, it may\nbe fruitful to use more sophisticated ideas, collectively known as statistical linearisation methods.\nIn this paper, we restricted our attention to processes with a state-independent diffusion coef\ufb01cient\nso that the covariance of the correction term could be computed. We may be able to extend this\nmethodology to process with state-dependent noise - certainly one could achieve this by taking a\n0-th order Taylor expansion about XNL. Whether it is possible to improve upon this idea is a matter\nfor further investigation.\n\nAcknowledgments\n\nSimon Lyons was supported by Microsoft Research, Cambridge.\n\nReferences\n[1] R.C. Merton. Theory of rational option pricing. The Bell Journal of Economics and Management Science,\n\n4:141\u2013183, 1973.\n\n[2] D.T. Gillespie. The chemical Langevin equation. Journal of Chemical Physics, 113,1:297\u2013306, 2000.\n\n8\n\n11.522.5300.511.522.533.5\u03b3p(\u03b3)11.522.5300.511.522.533.5\u03b3p(\u03b3)11.522.5302468\u03b3p(\u03b3)\f[3] G. Kallianpur. Weak convergence of stochastic neuronal models. Stochastic Methods in Biology, 70:116\u2013\n\n145, 1987.\n\n[4] H.A. Dijkstra, L.M. Frankcombe, and A.S. von der Heydt. A stochastic dynamical systems view of the\nAtlantic Multidecadal Oscillation. Philosophical Transactions of the Royal Society A, 366:2543\u20132558,\n2008.\n\n[5] L. Murray and A. Storkey. Continuous time particle \ufb01ltering for fMRI. Advances in Neural Information\n\nProcessing Systems, 20:1049\u20131056, 2008.\n\n[6] J. Daunizeau, K.J. Friston, and S.J. Kiebel. Variational Bayesian identi\ufb01cation and prediction of stochastic\n\nnonlinear dynamic causal models. Physica D, pages 2089\u20132118, 2009.\n\n[7] L.M. Murray and A.J. Storkey. Particle smoothing in continuous time: A fast approach via density\n\nestimation. IEEE Transactions on Signal Processing, 59:1017\u20131026, 2011.\n\n[8] W. Feller. An Introduction to Probability Theory and its Applications, Volume II. Wiley, 1971.\n[9] I. Karatzas and S.E. Shreve. Brownian Motion and Stochastic Calculus. Springer, 1991.\n[10] B. Oksendal. Stochastic Differential Equations. Springer, 2007.\n[11] C.W. Gardiner. Handbook of Stochastic Methods for Physics, Chemistry and the Natural Sciences.\n\n[12] Y. A\u00a8\u0131t-Sahalia. Closed-form likelihood expansions for multivariate diffusions. The Annals of Statistics,\n\nSpringer-Verlag, 1983.\n\n36(2):906\u2013937, 2008.\n\n[13] A. Beskos, O. Papaspiliopoulos, and G.O. Roberts. Monte-Carlo maximum likelihood estimation for\n\ndiscretely observed diffusion processes. Annals of Statistics, 37:223\u2013245, 2009.\n\n[14] A. Beskos, O. Papaspiliopoulos, G.O. Roberts, and P. Fearnhead. Exact and computationally ef\ufb01cient\nlikelihood-based estimation for discretely observed diffusion processes (with discussion). Journal of the\nRoyal Statistical Society: Series B (Statistical Methodology), 68:333\u2013382, 2006.\n\n[15] A. Golightly and D.J. Wilkinson. Bayesian inference for nonlinear multivariate diffusion models observed\n\nwith error. CSDA, 52:1674\u20131693, 2008.\n\n[16] S. Chib, M.K. Pitt, and N. Shepard. Likelihood-based inference for diffusion models. Working Paper,\n\n2004. http://www.nuff.ox.ac.uk/economics/papers/2004/w20/chibpittshephard.pdf.\n\n[17] G.B. Durham and A.R. Gallant. Numerical techniques for maximum likelihood estimation of continuous-\ntime diffusion processes (with comments). Journal of Business and Economic Statistics, 20:297\u2013338,\n2002.\n\n[18] A. Beskos, O. Papaspiliopoulos, and G.O. Roberts. Retrospective exact simulation of diffusion sample\n\npaths with applications. Bernoulli, 12(6):1077, 2006.\n\n[19] D. Rimmer, A. Doucet, and W.J. Fitzgerald. Particle \ufb01lters for stochastic differential equations of nonlin-\n\near diffusions. Technical report, Cambridge University Engineering Department, 2005.\n\n[20] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov Chain Monte Carlo meth-\n\nods. Journal of the Royal Statistical Society, 72:1\u201333, 2010.\n\n[21] C. Snyder, T. Bengtsson, P. Bickel, and J. Anderson. Obstacles to high-dimensional particle \ufb01ltering.\n\n[22] Y. A\u00a8\u0131t-Sahalia. Maximum likelihood estimation of discretely sampled diffusions: a closed-form approxi-\n\nMonthly Weather Review, 136(12):4629\u20134640, 2008.\n\nmation approach. Econometrica, 70:223\u2013262, 2002.\n\n[23] C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor. Gaussian process approximations of\n\nstochastic differential equations. JMLR: Workshop and Conference Proceedings, 1:1\u201316, 2007.\n\n[24] C. Archambeau, M. Opper, Y. Shen, D. Cornford, and J. Shawe-Taylor. Variational inference for diffusion\n\nprocesses. In Advances in Neural Information Processing Systems 20 (NIPS 2007), 2008.\n\n[25] S. S\u00a8arkk\u00a8a. On unscented Kalman \ufb01ltering for state estimation of continuous-time nonlinear systems.\n\nIEEE Transactions on Automatic Control, 52:1631\u20131641, 2007.\n\n[26] A.H. Jazwinski. Stochastic processes and \ufb01ltering theory, volume 63. Academic Pr, 1970.\n[27] H. Singer. Nonlinear continuous time modeling approaches in panel research. Statistica Neerlandica,\n\n62(1):29\u201357, 2008.\n\nArxiv:1008.4441, 2010.\n\n[28] S. Corlay and P. Gilles. Functional quantization based strati\ufb01ed sampling methods. Arxiv preprint\n\n[29] W. Luo. Wiener chaos expansion and numerical solutions of stochastic partial differential equations. PhD\n\nthesis, California Institute of Technology, 2006.\n\n[30] P.E. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Springer, 1999.\n[31] A.F. Bastani and S.M. Hosseini. A new adaptive Runge-Kutta method for stochastic differential equations.\n\nJournal of Computational and Applied Mathematics, 206:631\u2013644, 2007.\n\n[32] Y. Shen, C. Archambeau, D. Cornford, M. Opper, J. Shawe-Taylor, and R. Barillec. A comparison of\nvariational and Markov chain Monte Carlo methods for inference in partially observed stochastic dynamic\nsystems. Journal of Signal Processing Systems, 61(1):51\u201359, 2010.\n\n[33] H. Singer. Parameter estimation of nonlinear stochastic differential equations: simulated maximum like-\nlihood versus extended Kalman \ufb01lter and It\u02c6o-Taylor expansion. Journal of Computational and Graphical\nStatistics, 11(4):972\u2013995, 2002.\n\n[34] M. Opper, A. Ruttor, and G. Sanguinetti. Approximate inference in continuous time Gaussian-jump\n\nprocesses. Advances in Neural Information Processing Systems, 23:1831\u20131839, 2010.\n\n[35] N.G. van Kampen. Stochastic processes in physics and chemistry. North holland, 2007.\n\n9\n\n\f", "award": [], "sourceid": 962, "authors": [{"given_name": "Simon", "family_name": "Lyons", "institution": null}, {"given_name": "Amos", "family_name": "Storkey", "institution": null}, {"given_name": "Simo", "family_name": "S\u00e4rkk\u00e4", "institution": null}]}