{"title": "Analytical Results for the Error in Filtering of Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2303, "page_last": 2311, "abstract": "Bayesian filtering of stochastic stimuli has received a great deal of attention re- cently. It has been applied to describe the way in which biological systems dy- namically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the mean- squared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We find that there is an optimal tuning width for which the error is minimized. This leads to a char- acterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing.", "full_text": "Analytical Results for the Error in Filtering of\n\nGaussian Processes\n\nBernstein Center for Computational Neuroscience Berlin,Technische Universit\u00a8at Berlin\n\nalex.susemihl@bccn-berlin.de\n\nAlex Susemihl\n\nRon Meir\n\nDepartment of Eletrical Engineering, Technion, Haifa\n\nrmeir@ee.technion.ac.il\n\nBernstein Center for Computational Neuroscience Berlin, Technische Universit\u00a8at Berlin\n\nopperm@cs.tu-berlin.de\n\nManfred Opper\n\nAbstract\n\nBayesian \ufb01ltering of stochastic stimuli has received a great deal of attention re-\ncently. It has been applied to describe the way in which biological systems dy-\nnamically represent and make decisions about the environment. There have been\nno exact results for the error in the biologically plausible setting of inference on\npoint process, however. We present an exact analysis of the evolution of the mean-\nsquared error in a state estimation task using Gaussian-tuned point processes as\nsensors. This allows us to study the dynamics of the error of an optimal Bayesian\ndecoder, providing insights into the limits obtainable in this task. This is done for\nMarkovian and a class of non-Markovian Gaussian processes. We \ufb01nd that there\nis an optimal tuning width for which the error is minimized. This leads to a char-\nacterization of the optimal encoding for the setting as a function of the statistics\nof the stimulus, providing a mathematically sound primer for an ecological theory\nof sensory processing.\n\n1\n\nIntroduction\n\nBiological systems are constantly interacting with a dynamic, noisy environment, which they can\nonly assess through noisy sensors. Models of Bayesian decision-making have been suggested to\naccount for the functioning of biological systems in many areas [1, 2]. Here, we concentrate on the\nproblem of Bayesian \ufb01ltering of stochastic processes. There have been many studies on \ufb01ltering\nof stimuli by biological systems [1, 2, 3], however, there are very few analytical results regarding\nthe error of Bayesian \ufb01ltering. We provide exact expressions for the evolution of the Mean Squared\nError (MSE) of Bayesian \ufb01ltering for a class of Gaussian processes. Results for expected errors of\nGaussian processes had been sofar obtained only for the problem of smoothing, where predictions\nare not online but have to be made using past and future observations [4, 5].\nThe present work seeks to give an account of the error properties in Bayesian \ufb01ltering of stochastic\nprocesses. We start by analysing the case of Markovian processes in section 2. We \ufb01nd a set of\n\ufb01ltering equations from which we can derive a differential equation for the expected mean squared\nerror. This provides a way to optimize the system parameters (the \u2019encoder\u2019) in order to minimize\nthe error. We present an implicit equation to optimize the encoding scheme in the case of Poisson\nspike observations. We also provide a full stochastic model of the evolution of the error, which can\n\n1\n\n\fbe solved analytically in a given interval. Useful approximations for the distribution of the error are\nalso provided. In section 3 we show an application to optimal population coding in sensory neurons.\nIn section 4 we extend the same framework to higher order processes, where we can control the\nsmoothness by the order of the process. We \ufb01nalize with a brief discussion. Our theoretical results\ncontribute to the ongoing research on ecological theories in biological signal processing (e.g., [6]),\nwhich argue that performance of sensory systems can be enhanced by allowing sensors to adapt\nto the statistics of the environment. While an increasing amount of biological evidence has been\naccumulating for such theories (e.g., [7, 8, 9, 10, 11]) there has been little work providing exact\nanalytic demonstration of its utility so far.\n\n2 Bayesian Filtering for the Ornstein-Uhlenbeck Process\n\nConsider the problem of estimating a dynamically evolving state in continuous time based on partial\nnoisy observations. In classic approaches one assumes that the state is observed either continuously\nor at discrete times, leading to the celebrated Kalman \ufb01lter and its extensions. We are concerned here\nwith a setup of much interest in Neuroscience (as well as in Queueing theory) where the observations\ntake the form of a a set of point processes. More concretely, let X(t) be a stochastic process,\nand let M \u2018sensory\u2019 processes be de\ufb01ned, each of which generates a Poisson point process with\na time-dependent rate function \u03bbm(X(t), t), m = 1, 2, . . . , M. Such a stochastic process is often\nreferred to as a doubly stochastic point process. In a neuroscience context \u03bbm(\u00b7) represents the\ntuning function of the m\u2019th sensory cell. In order to maintain analytic tractability we focus in this\n\nwork on a Gaussian form for \u03bbm, given by \u03bbm(X(t), t) = \u03c6 exp(cid:2)\u2212(X(t) \u2212 \u03b8m)2/2\u03b1(t)2(cid:3), where\n\n\u03b8m are the tuning function centers. We will assume the tuning function centers are equally spaced\nwith spacing \u2206\u03b8, for simplicity, although this is not essential to our arguments.\nThough the rate of observations for the individual processes depends on the instantaneous value of\nthe process, it can be shown that under certain assumptions the total rate of observations (the rate\nby which observations by all processes are generated) is independent of the process. If we assume\nthat the processes are independent and assume that the probability of the stimulus falling outside the\nrange spanned by the tuning function centers is negligible, we obtain the total rate of observations\n\n(cid:88)\n\n(cid:20)\n\u2212 (X(t) \u2212 \u03b8m)2\n\n(cid:21)\n\n2\u03b12(t)\n\n\u221a\n\n\u2248\n\n2\u03c0\u03c6\u03b1(t)\n\n\u2206\u03b8\n\n.\n\n\u03bbm(X(t), t) = \u03c6\n\nexp\n\nm\n\nm\n\n\u03bb(t) \u2248(cid:88)\n\nThis approximation is discussed extensively in [12] and is seen to be very precise as long as \u03b1 is of\nthe same or of a larger order of magnitude as \u2206\u03b8. Denoting the set of observations generated by the\nsensory processes by \u03be = {(ti, mi, \u0398i)}1, we have the probability of a given set of observations \u03be\ngiven a stimulus history X[t0,t]\n(cid:82) tf\n\n\u03bbm(X(t),t)dt(cid:89)\n\n\u03bb(t)dt(cid:89)\n\n\u2212(cid:80)\n\n\u2212(cid:82) tf\n\nt0\n\n\u03bbmi(X(ti), ti) = e\n\nP (\u03be|X[t0,t]) = e\n\nm\n\nt0\n\n\u03bbmi(X(ti), ti).\n\ni\n\ni\n\nThis de\ufb01nes the likelihood of the observations. Note that without the independence of the total rate\nfrom the stimulus, the likelihood would not be Gaussian due to the \ufb01rst term in the product. We\nneed to evaluate the posterior probability P (X(t)|\u03be). We have\n\nP (X(t)|\u03be) \u221d P (X(t))P (\u03be|X(t)) = P (X(t))\n\nd\u00b5(X[t0,t))P (\u03be|X[t0,t))P (X[t0,t)|X(t)).\n\n(cid:90)\n\n(cid:88)\n\ns(t, \u03be) = K(0) \u2212(cid:88)\n\nThe equations involved are Gaussian and evaluating them we obtain the usual Gaussian process\nregression equations (see [13] and [14, p. 17])\n\n\u00b5(t, \u03be) =\n\nK(t \u2212 ti)C\u22121\n\nij \u0398j,\n\nK(t \u2212 ti)C\u22121\n\nij K(tj \u2212 t), 2\n\n(1)\n\ni,j\n\ni,j\n\nwhere K(t \u2212 t(cid:48)) is the auto-correlation function or kernel of the Gaussian process X(t). This\nspeci\ufb01es the posterior distribution P (X(t)|\u03be) = N (\u00b5(t, \u03be), s(t, \u03be)).\n\nobservation and \u0398i = \u03b8mi is the mean of the Gaussian rate function.\n\n1Here the time ti denotes the time of the i-th observation, mi gives the identity of the sensor making the\n2Cij(\u03be) = K(ti \u2212 tj) + \u03b4ij\u03b1(ti)2\n\n2\n\n\fOur object of interest is the average mean squared error of the Bayesian estimator at a time t based\non past observations. This is the minimal mean-squared error of the optimal Bayesian estimator\n\u02c6X(t; \u03be) = (cid:104)X(t)(cid:105)X(t)|\u03be with respect to a mean-squared error loss function. It is given by\n\n(cid:28)(cid:68)\n(X(t) \u2212 \u02c6X(t; \u03be))2(cid:69)\n\n(cid:29)\n\nX(t)|\u03be\n\n=\n\n\u03be\n\n(cid:68)(cid:10)(X(t) \u2212 \u00b5(t; \u03be))2(cid:11)\n\n(cid:69)\n\nX(t)|\u03be\n\n= (cid:104)s(t; \u03be)(cid:105)\u03be .\n\n\u03be\n\nM M SE(t) =\n\nHere we have written the averaging in the reverse of the usual order and have used \u02c6X(t, \u03be) = \u00b5(t, \u03be)\nin the second step. Note that the posterior variance is independent of the value of the observations,\ndepending solely on the observation times. However the exact result is still intractable due to both\nthe complex dependence of s(t, \u03be) on the observation times and the averaging over these. Note that\nso far the results hold for all kinds of Gaussian processes.\nIf we make a Markov assumption about the structure of the kernel K(t \u2212 t(cid:48)) we are able to\nmake statements about the evolution of the posterior variance between observations. This allows\nus to derive the differential Chapman-Kolmogorov equation [15] for the evolution of the poste-\nrior variance and then obtain the evolution of the MMSE. For the Ornstein-Uhlenbeck process\ndX(t) = \u2212\u03b3X(t)dt + \u03b7dW (t) we have the kernel k(\u03c4 ) = \u03b72\n2\u03b3 e\u2212\u03b3|\u03c4| and the differential equa-\ntion for the evolution of the posterior variance between observations (see [16, p. 40] for example)\n\nds(t)\n\ndt\n\n= \u22122\u03b3s(t) + \u03b72.\n\n(2)\n\nWhen a new observation arrives, the distribution is updated through Bayes\u2019 rule. Using that\nP (X(t)) = N (\u00b5(t), s(t)) and P (\u03b8i|X) \u221d N (\u03b8i; X, \u03b12(t)), one can see that\n\u03b12(t)s(t)\n\n(cid:18) \u03b12(t)\u00b5(t) + s(t)\u03b8i\n\n(cid:19)\n\nP (X(t)|(t, \u03b8i)) = N\n\n(3)\n\n\u03b12(t) + s(t)\n\n\u03b12(t) + s(t)\n\n,\n\n.\n\nHere, as before, the posterior variance is independent of the speci\ufb01c observation \u03b8i, therefore we\nneed only concentrate on the times of observations for purposes of modeling the posterior variance.\nThe evolution of the posterior variance is a Markov process which is driven by a deterministic drift,\ngiven in Eq. 2, and is also subject to discontinuous jumps at random times, which account for the\nobservations, described by Eq. 3. This continuous time stochastic process is de\ufb01ned by a transition\nprobability which in the time limit of in\ufb01nitesimal time dt \u2192 0 is given by\nP (s(cid:48), t + dt|s, t) = (1 \u2212 \u03bb(t)dt)\u03b4(s(cid:48) \u2212 s + dt(2\u03b3s \u2212 \u03b72)) + \u03bb(t)dt\u03b4\n\n(cid:18)\n\n(cid:19)\n\n(4)\n\n.\n\ns(cid:48) \u2212 \u03b1(t)2s\n\u03b1(t)2 + s\n\nIn the equation above, the \ufb01rst term accounts for the drift given in Eq. 2 and the second term accounts\nfor the jumps given by Eq. 3. Using (4), and following a standard approach described in Gardiner\n[15, p. 47], we obtain a partial differential equation, the so-called differential Chapman-Kolmogorov\nequation for the exact time evolution of the marginal probability density P (s, t)\n\n(cid:2)(2\u03b3s \u2212 \u03b72)P (s, t)(cid:3) + \u03bb\n\n(cid:18) \u03b12\n\n\u03b12 \u2212 s\n\n\u2202P (s, t)\n\n\u2202t\n\n=\n\n\u2202\n\u2202s\n\n\u2212 \u03bbP (s, t).\n\n(5)\n\nThis equation is, however, too complicated to be solved exactly in the general case. We can use it to\nderive the evolution of statistical averages by noting that d(cid:104)f (s)(cid:105)\n. For f (s) = s\nwe obtain an exact equation for the evolution of the average error. Writing \u0001 = (cid:104)s(cid:105), we have\n\n\u2202t\n\nP\n\n, t\n\n(cid:19)\n\n\u03b12 \u2212 s\n\n(cid:19)2\n(cid:18) \u03b12s\ndt =(cid:82) dsf (s) \u2202P (s,t)\n(cid:29)\n\n(cid:28)\n\ns2\n\n\u03b12(t) + s\n\n.\n\nP (s,t)\n\n= \u22122\u03b3\u0001 + \u03b72 \u2212 \u03bb(t)\n\nd\u0001\ndt\n\n2.1 Mean \ufb01eld approximation\n\n(6)\n\nWe will now derive a good closed form approximate equation for the expected posterior variance\n\u0001 = (cid:104)s(cid:105) from (6). Note that the expectation of the nonlinear function on the right hand side is again\nintractable but can be approximated using a mean-\ufb01eld approximation of the type (cid:104)f (s)(cid:105) \u2248 f ((cid:104)s(cid:105)).\nWe obtain\n\nd\u0001mf\ndt\n\n= \u22122\u03b3\u0001mf + \u03b72 \u2212 \u03bb(t)\n\n\u00012\nmf\n\n\u03b1(t)2 + \u0001mf\n\n.\n\n(7)\n\n3\n\n\fThis approximation works remarkably well, giving an excellent account of the equilibrium regime\nand of the relaxation of the error as can be seen in Fig. 2 for the case of population coding. We\ncan also minimize the change in \u0001 at each time step with respect to the sensor parameters \u03b1, \u03c6 to\n\ufb01nd optimal values for them. The maximal observation rate \u03c6 is quite trivial, as an increase in \u03c6\nincreases the effect of observations linearly. Therefore without a cost associated to observations,\nthere is no optimal value for \u03c6, since increasing it will always lead to lower values of \u0001. Minimizing\nthe derivative of \u0001 with respect to \u03b1 however, yields an implicit equation for the optimal value of\n\u03b1(t)\n\n(cid:29)\n\n(cid:44)(cid:28)\n\n(cid:29)\n\ns2\n\n(cid:28)\n\n\u03b1opt(t)2 =\n\ns3\n\n(\u03b1opt(t)2 + s)2\n\nP (s,t)\n\n(\u03b1opt(t)2 + s)2\n\nP (s,t)\n\n(8)\n\nUsing again a mean-\ufb01eld approach, we obtain the simple result for the time-dependent tuning width\nopt(t) = \u0001(t), so the square of the optimal tuning width is the average error of the current estimate\n\u03b12\nof the process. This is interesting as it accounts for sharpening of the gaussian rates when the error\nis small and broadening when the error is large.\n\n2.2 Exact results for the stationary distribution\n\nWe will now assume that both \u03bb and \u03b1 are time independent so that the stochastic process converges\nto a stationary state described by \u2202P (s,t)\n\u2202t = 0. To obtain information about this stationary solution it\nis useful to introduce the new variable z = \u03b72/(\u03b3s). The linear ODE 2 transforms into a nonlinear\none \u02d9z(t) = \u03b3z(2 \u2212 z). This slight complication comes with a great simpli\ufb01cation for the jump\nconditions. In the new variable this is simply z(cid:48) = z + \u03b4 where \u03b4 = \u03b72/(\u03b3\u03b12) does not depend on z.\nHence the differential Chapman-Kolmogorov equation (specialised to the stationary state) is simply\n\n\u2212 d\ndz\n\n[\u03b3z(2 \u2212 z)P (z)] + \u03bbP (z \u2212 \u03b4) \u2212 \u03bbP (z) = 0\n\n(9)\n\nViewing z as a temporal variable, we can treat Eq. 9 as a delay differential equation which depends\non p at previous values of z. If we knew P (z) in an interval z0 \u2212 \u03b4 \u2264 z < z0, Eq. 9 would however\nbecome a simple ordinary linear differential equation with a known inhomogeneity P (z \u2212 \u03b4) in the\ninterval z0 \u2264 z \u2264 z0 + \u03b4 which could be solved explicitly by numerical quadrature. Repeating this\nprocedure would allow us to obtain p(z) iteratively for all z > 0. A simple argument shows that\nP (z) = 0 for z < 2. Since jumps can only increase z and since also \u02d9z(t) > 0 for z < 2, we \ufb01nd that\nin the stationary state, the interval 0 \u2264 z < 2 will become depopulated. Hence, for 2 \u2264 z \u2264 2 + \u03b4\nwe have\n\n\u2212 d\ndz\n\n[\u03b3z(2 \u2212 z)P (z)] = \u03bbP (z)\n\nvalid for s \u2208 (cid:104) \u03b12\u03b72\n\n(cid:105)\n\nwhich is solved by P (z) \u221d z\u22122(1 \u2212 2/z)\u22121+\u03bb/2\u03b3 Transforming back to the original error variable\ns yields\n\nPeq(s) \u221d (\u03b72 \u2212 2\u03b3s)\n\n\u03bb\n\n2\u03b3 \u22121.\n\n(10)\n\n2\u03b3\n\n2\u03b3\u03b12+\u03b72 , \u03b72\n\n. This is a very interesting result, as it shows a diverging behaviour\nin the equilibrium for values of \u03bb < 2\u03b3. This singularity can also be veri\ufb01ed in the simulations.\nThis solution gives us a good intuition about the coding properties of the system. When the average\ntime between observations \u03c4obs = 1/\u03bb is smaller than the relaxation time of the process\u2019 variance\n\u03c4var = 1/2\u03b3, the most probable value for the error will be the equilibrium variance of the observed\nprocess \u03b72/2\u03b3. Note however that the expected error is always smaller than \u03b72/2\u03b3. When \u03bb = 2\u03b3\nwe observe a transition and the most likely error becomes smaller.\nIt was not possible to give\nclosed form analytical expressions for p(z) in the following intervals because the integrals are not\nanalytically tractable. We can, however, solve (10) numerically obtaining great agreement with the\nsimulated histograms. For very small values of \u03b4, the numerical integration becomes less reliable,\nas the valid intervals become increasingly small, requiring a very small integration step. This can be\nseen in Fig. 1.\nWe can get asymptotic expressions for P (z) when parameters are such that the relative \ufb02uctuations\nof z are small. This is expected to hold for small jumps \u03b4 (when the system is trivially almost\ndeterministic) and/or for large jump rates \u03bb, when the density of jumps is so large that relative\n\ufb02uctuations are small. Using again a simple mean \ufb01eld argument as before shows that in such\n\n4\n\n\fFigure 1: Comparison of the different regimes for the equilibrium distribution. Top left we can see \u03b1 = \u03c6 = 1.\nNote that neither solutions cover all of the range of the distribution, although the exact solution captures the\nbehaviour very well in the low z region. Top right we can see the low \u03b1 regime. Note that the exact solution\naccounts for the distribution on most of the range of the distribution. In the bottom we see the cases where the\nGaussian approximation excels. Both large \u03b1 and \u03c6 result in an approximately Gaussian distribution, as we\nhave derived above. The blue line (exact solution) is hardly discernible from the red line (histogram) in the\nsmall \u03b1 case, as is the black line (Gaussian approximation) in the large \u03b1 or \u03c6 case.\n\nsituations we \ufb01nd that in equilibrium z should be close to z\u2217 = 1 +(cid:112)1 + \u03bb\u03b4/\u03b3. For both small \u03b4\n\nand/or large \u03bb, for z close to z\u2217 we have \u03b4 (cid:28) z\u2217 and we can expand p(z \u2212 \u03b4) in a Taylor series to\nsecond order in \u03b4. Linearising also the drift \u03b3z(2 \u2212 z) around z\u2217 yields a Fokker-Planck equation\nwhich is equivalent to a simple diffusion process (of the Ornstein-Uhlenbeck type) which is solved\nby the Gaussian density P (z) = N\n. In Fig. 1 we present the\ndifferent approximations compared to the simulated histograms of the posterior variance.\n\u221a\nWe present results for the speci\ufb01c choice \u03b7 = \u03b3 = 1. Note however, that through a scaling of\nparameters \u03b1(cid:48) = \u03b1\u03b7/\n\u03b3 and \u03c6(cid:48) = \u03c6\u03b3 we can obtain the MMSE for any value of the four parameters\nwith the values for \u03b7 = \u03b3 = 1. In this way, rescaling the parameters, we can obtain the MMSE for\nany values of \u03b7, \u03b3, \u03b1 and \u03c6.\n\n1 +(cid:112)1 + \u03bb\u03b4/\u03b3,\n\n\u03bb\u03b42\n1+\u03bb\u03b4/\u03b3)\n\n\u221a\n\n(4\u03b3\n\n(cid:18)\n\n(cid:19)\n\n3 Optimal Population Coding\n\nAs an application we look into the problem of neural population coding of dynamic stimuli (see\n[13]). We model the spiking of neurons as doubly stochastic Poisson processes driven by the stimu-\nlus X(t), that is the probability of a given neuron \ufb01ring a spike in a given interval [t, t + dt] is given\nby\n\nPt(spikem|X(t)) = \u03c6e\n\n\u2212 (X(t)\u2212\u03b8m )2\n\n2\u03b1(t)2\n\ndt,\n\nand Pt(spike|X(t)) \u2248\n\n\u221a\n\n2\u03c0\u03c6\u03b1(t)dt\n\n\u2206\u03b8\n\n= \u03bb(t)dt.\n\nUnder these assumptions, the inference from a spike train is equivalent to that on observations of\ndata, and the MMSE follows the differential Eq. 6. Again, the fact that the posterior variance\ndepends solely on the spike times allows us to substitute the spiking processes for each neuron with\none spiking process for the whole population, simplifying greatly our calculations. We compare the\nframework derived with the dynamic population coding presented in [13] in Fig. 2.\nWe have calculated the MMSE for a range of values for \u03b1 and \u03c6 to obtain the dependence of the\nMMSE on these parameters.\nIn Fig. 3 we show the mean-\ufb01eld treatment of Eq. 6 as well as\nsimulations of the dynamics given by Eq. 4. The mean-\ufb01eld approximation works remarkably well,\nyielding a relative error smaller than 2% throughout the range of parameters. The approximate\nand simulated error maps are virtually indistinguishable. As can be seen in Fig. 2 the mean-\ufb01eld\napproximation also works very well to re\ufb02ect the dynamics of the error.\n\n5\n\n\fFigure 2: Neural coding of an second-order Markov process as described in the text. Top \ufb01gure shows the\nprocess overlayed with posterior mean and con\ufb01dence intervals. The bottom plot shows the posterior variance\nof one sample run in black, the average over a thousand runs in blue and the mean-\ufb01eld dynamics in red. Code\nmodi\ufb01ed from [13]\n\nFigure 3: MMSE for the Ornstein-Uhlenbeck process. On the left we have the average MMSE obtained by the\nsimulation and on the right the value of the MMSE as a function of \u03b1 for a few values of \u03c6 in the mean-\ufb01eld\napproximation. The dots are the minima for the mean-\ufb01eld and the dotted curves are mean-\ufb01eld values for the\nsame \u03c6. The mean-\ufb01eld leads to a very good approximation, and the optimal \u03b1 for the approximation is a good\nestimator for the optimal \u03b1 in the simulation.\n\n6\n\n\f4 Filtering Smoother Processes\n\nTo study the \ufb01ltering of smoother processes we will look at higher-order Markov processes. We do\nso by considering a multidimensional stochastic process which is Markovian if we consider all of\nthe components, but restrict ourselves to one component, which will then exhibit a non-Markovian\nstructure. This is done by an extension to the Ornstein-Uhlenbeck process frequently used in Gaus-\nsian process literature, whose correlation structure is given by the Matern kernel (see below). We\nhave to work with the covariance matrix of the system, since its elements\u2019 dynamics are coupled.\nThus, Eq. 6 will be replaced by a matrix equation, to which we then apply the same treatment.\nWe consider a p-th order stochastic process such as ap+1X (p)(t) + apX (p\u22121)(t) + \u00b7\u00b7\u00b7 + a1X(t) =\n\u03b7Z(t), where Z(t) is white Gaussian noise with covariance \u03b4(t \u2212 t(cid:48)) and X (n)(t) denotes the n-th\nderivative of X(t). Writing the proper Ito stochastic differential equations we obtain a set of p \u2212 1\n\ufb01rst order differential equations and a single \ufb01rst order stochastic differential equation,\n\n\u02d9X1 = X2,\n\n\u02d9X2 = X3, . . . ,\n\nwhere Wt is the Wiener process. Choosing ak =(cid:0) p\n\nk\u22121\nan autocorrelation function given by the Matern kernel\n\n\u02d9Xp\u22121 = Xp,\n\nap+1dXp = \u2212 p(cid:88)\n(cid:1)\u03b3p+1\u2212k which yields processes X1(t) with\n\naiXidt + \u03b7dWt,\n\ni=1\n\nk(\u03c4 ; \u03bd, \u03b3) =\n\n\u221a\n\n\u03b722\u2212\u03bd\n\n\u03c0\u0393(\u03bd + 1/2)\u03b3\u03bd (\u03b3\u03c4 )\u03bd K\u03bd (\u03b3\u03c4 ) ,\n\nwhere \u03bd + 1\n2 = p, K\u03bd(x) is the modi\ufb01ed Bessel function of the second kind and \u03b3 is the parameter\ndetermining the characteristic time of the kernel. Note that the one-dimensional Ornstein-Uhlenbeck\nprocess is a special case of this with p = 1, \u03bd = 1/2. We can control the smoothness of the process\nX1(t) with the parameter \u03bd, increasing it yields successively smoother processes (see supplementary\ninformation).\nWe can express this as a multidimensional stochastic process by choosing \u0393i,j = \u2212\u03b4i,j\u22121 + \u03b4i,paj\nand \u03a31/2\ni,j = \u03b4i,p\u03b4j,p\u03b7, where \u03b4i,j is the Kronecker delta. We then have the Ito stochastic differential\nequation\n\n(11)\nfor (cid:126)X(t)T = (X1(t), X2(t), . . . , Xp(t)). The covariance matrix then evolves according to (see [16,\np. 40])\n\nd (cid:126)X(t) = \u2212\u0393 (cid:126)X(t)dt + \u03a31/2d (cid:126)W\n\n= \u2212\u0393\u03c3 \u2212 \u03c3\u0393T + \u03a3.\n\nd\u03c3\ndt\n\n(12)\nThis can be solved using the solution of the homogeneous equation \u03a3(t) = exp[\u2212t\u0393] exp[\u2212t\u0393t]\nand the solution to the inhomogeous equation given by the equilibrium solution.\nWe assume that only the component X1 is observed, that is, the rate of observations only depends on\nthat component. We have then P (X1, X2:p|obs) \u221d P (obs|X1)P (X1, X2:p). Note that the precision\nmatrix (the inverse of the covariance matrix) will be updated simply by adding the likelihood term\n1/\u03b1(t)2 to the \ufb01rst diagonal element. Using the block matrix inversion theorem we obtain the new\ncovariance matrix\n\ni,j = \u03c3i,j \u2212 \u03c31,i\u03c31,j\n\u03c3(cid:48)\n\u03b12 + \u03c31,1\n\n.\n\n(13)\n\nPutting equations 12 and 13 together we obtain the differential Chapman-Kolmogorov equation for\nthe evolution of the probability of the covariance matrix. With this we obtain the differential equation\nfor the average posterior covariance matrix\nd(cid:104)\u03c3i,j(cid:105)\n\n(cid:28) \u03c31,i\u03c31,j\n\n= (cid:104)\u03c3i+1,j(cid:105)+(cid:104)\u03c3i,j+1(cid:105)\u2212(cid:88)\n\n(\u03b4i,pal (cid:104)\u03c3i,l(cid:105) + \u03b4j,pal (cid:104)\u03c3j,l(cid:105))\u2212\u03bb(t)\n\n+\u03b72\u03b4i,n\u03b4j,n,\n\n(cid:29)\n\n\u03b1(t)2 + \u03c31,1\n\ndt\n\nl\n\n(14)\nwhere we abuse the notation by using that \u03c3i,j = 0,if i > p or j > p. These can be solved in\nthe mean-\ufb01eld approximation to obtain an approximation for the covariance matrix. We also note\nthat one can derive a recursion scheme to express all of the elements as functions of the \ufb01rst row\nof covariances \u03c31,1:p. With these expressions we can then use the equilibrium conditions for d(cid:104)\u03c3i,i(cid:105)\n\ndt\n\n7\n\n\fFigure 4: MMSE for a second-order stochastic process. On the left is the color map of the \ufb01rst diagonal\nelement of the covariance matrix for the \u03bd = 3/2 case, corresponding to the variance of the observed stimulus\nvariable and on the right, the same element as a function of \u03b1 for a few values of \u03c6. The overall dependence of\nthe error on \u03b1 and \u03c6 is strikingly similar to the OU process, with lower values of the MMSE, however. This is\ndue to the smoothness of the process, making it more predictable. In red we show the MMSE for the simulated\nequilibrium variance for comparison. Though the mean-\ufb01eld approximation is not as good as in the OU case,\nthe relative error of it still falls below 18% throughout the range of parameters studied.\n\nto solve for the equilibrium value of (cid:104)\u03c3i,j(cid:105). We provide results for the case p = 2, \u03bd = 3/2. The\nequilibrium MMSE is shown in Fig. 4 on the left and in Fig. 4 on the right we show the dependence\non \u03b1 of the MMSE. The dependence of the error on the parameters resembles strongly that of the\nOrnstein-Uhlenbeck process, showing a \ufb01nite optimal value of \u03b1 which minimizes the error given\n\u03c6. This becomes less pronounced as we go to very low \ufb01ring rates. Note that for the second-order\nprocess the MMSE relative to the variance of the observed process (MMSE/K(0)) drops to lower\nvalues than in the Ornstein-Uhlenbeck process, leading to a better state estimation. We expect that\nthe error will become increasingly smaller for higher-order processes.\n\n5 Discussion\n\nWe have shown that the dynamics of Bayesian state estimation error for Markovian processes can be\nmodelled by a simple dynamic system. This provides insight into generalization properties of Gaus-\nsian process inference in an online, causal setting, where previous generalization error calculations\n[4, 5] for Gaussian processes do not apply. In the context of \ufb01ltering the usual generalization error\ncalculations do not apply. Furthermore, we have demonstrated that a simple mean-\ufb01eld approxima-\ntion succesfully captures the dynamics of the average error of the described inference framework.\nThis was shown in detail for the case of Ornstein-Uhlenbeck processes, and for a class of higher-\norder Markov processes.\nOne key feature we were able to verify is the existence of an optimal tuning width for Gaussian-tuned\nPoisson processes which minimizes the MMSE, as has been veri\ufb01ed elsewhere for static stimuli\n([17, 12, 18]). This result is robust to the inclusion of coloured noise, as we have shown by modelling\na second order process.\nFuture research could concentrate in generalizing the presented framework towards more realistic\nspike generation models, such as integrate-and-\ufb01re neurons. The generalization to broader classes\nof stimuli would be of great interest as well. These results provide a promising \ufb01rst step towards a\nmathematical theory of ecologically grounded sensory processing.\n\n8\n\n\f6 Acknowledgements\n\nThe work of Alex Susemihl was supported by the DFG Research Training Group GRK1589/1. The\nwork of Ron Meir was partially supported by grant No. 665/08 from the Israel Science Foundation.\n\nReferences\n[1] Tetsuya J. Kobayashi. Implementation of dynamic bayesian decision making by intracellular\n\nkinetics. Phys. Rev. Lett., 104(22):228104, Jun 2010.\n\n[2] Jean-Pascal P\ufb01ster, Peter Dayan, and Mate Lengyel. Know thy neighbour: A normative theory\nof synaptic depression.\nIn Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and\nA. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1464\u20131472.\n2009.\n\n[3] Omer Bobrowski, Ron Meir, Shy Shoham, and Yonina C. Eldar. A neural network implement-\ning optimal state estimation based on dynamic spike train decoding. In Neural Information\nProcessing Systems, 2007.\n\n[4] Dorthe Malzahn and Manfred Opper. A statistical physics approach for the analysis of machine\nlearning algorithms on real data. Journal of Statistical Mechanics: Theory and Experiment,\n2005(11):P11001, 2005.\n\n[5] P. Sollich and A. Halees. Learning curves for gaussian process regression: Approximations\n\nand bounds. Neural Computation, 14(6):1393\u20131428, 2002.\n\n[6] J Atick and A.N. Redlich. Could information theory provide an ecological theory of sensory\n\nprocessing? Network: Computation in Neural Systems, 5:213\u2013251, 1992.\n\n[7] M.W. Pettet and C.D. Gilbert. Dynamic changes in receptive-\ufb01eld size in cat primary visual\n\ncortex. Proceedings of the National Academy of Sciences, 89(17):8366\u20138370, 1992.\n\n[8] N. Brenner, W. Bialek, and R. de Ruyter van Steveninck. Adaptive rescaling maximizes infor-\n\nmation transmission. Neuron, 26(3):695\u2013702, 2000.\n\n[9] V. Dragoi, J. Sharma, and M. Sur. Adaptation-induced plasticity of orientation tuning in adult\n\nvisual cortex. Neuron, 28(1):287\u2013298, 2000.\n\n[10] I. Dean, B.L. Robinson, N.S. Harper, and D. McAlpine. Rapid neural adaptation to sound level\n\nstatistics. Journal of Neuroscience, 28(25):6430\u20136438, 2008.\n\n[11] T. Hosoya, S.A. Baccus, and M. Meister. Dynamic predictive coding by the retina. Nature,\n\n436(7047):71\u201377, 2005.\n\n[12] Steve Yaeli and Ron Meir. Error-based analysis of optimal tuning functions explains phenom-\n\nena observed in sensory neurons. Frontiers in Computational Neuroscience, 5(0):12, 2010.\n\n[13] Quentin J. M. Huys, Richard S. Zemel, Rama Natarajan, and Peter Dayan. Fast population\n\ncoding. Neural Computation, 19(2):404\u2013441, 2007.\n\n[14] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT\n\nPress, 55 Hayward Street, Cambridge, MA 02142, 2006.\n\n[15] C.W. Gardiner. Stochastic Methods: A Handbook for the Natural and Social Sciences, vol-\n\nume 13 of Springer Serier in Synergetics. Springer, Berlin Heidelberg, fourth edition, 2009.\n\n[16] Hannes Risken. The Fokker-Planck Equation: Methods of Solutions and Applications, vol-\nume 18 of Springer Series in Synergetics. Springer, Berlin Heidelberg, second ed. 1989. third\nprinting edition, 1996.\n\n[17] M. Bethge, D. Rotermund, and K. Pawelzik. Optimal short-term population coding: When\n\n\ufb01sher information fails. Neural Computation, 14(10):2317\u20132351, 2002.\n\n[18] Philipp Berens, Alexander S. Ecker, Sebastian Gerwinn, Andreas S. Tolias, and Matthias\nBethge. Reassessing optimal neural population codes with neurometric functions. Proceedings\nof the National Academy of Sciences, 108(11):4423\u20134428, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1241, "authors": [{"given_name": "Alex", "family_name": "Susemihl", "institution": null}, {"given_name": "Ron", "family_name": "Meir", "institution": null}, {"given_name": "Manfred", "family_name": "Opper", "institution": null}]}