{"title": "Gaussian Process Training with Input Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 1341, "page_last": 1349, "abstract": "In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods.", "full_text": "Gaussian Process Training with Input Noise\n\nAndrew McHutchon\n\nDepartment of Engineering\n\nCambridge University\nCambridge, CB2 1PZ\najm257@cam.ac.uk\n\nCarl Edward Rasmussen\nDepartment of Engineering\n\nCambridge University\nCambridge, CB2 1PZ\ncer54@cam.ac.uk\n\nAbstract\n\nIn standard Gaussian Process regression input locations are assumed to be noise\nfree. We present a simple yet effective GP model for training on input points cor-\nrupted by i.i.d. Gaussian noise. To make computations tractable we use a local\nlinear expansion about each input point. This allows the input noise to be recast\nas output noise proportional to the squared gradient of the GP posterior mean.\nThe input noise variances are inferred from the data as extra hyperparameters.\nThey are trained alongside other hyperparameters by the usual method of max-\nimisation of the marginal likelihood. Training uses an iterative scheme, which\nalternates between optimising the hyperparameters and calculating the posterior\ngradient. Analytic predictive moments can then be found for Gaussian distributed\ntest points. We compare our model to others over a range of different regression\nproblems and show that it improves over current methods.\n\n1\n\nIntroduction\n\nOver the last decade the use of Gaussian Processes (GPs) as non-parametric regression models has\ngrown signi\ufb01cantly. They have been successfully used to learn mappings between inputs and outputs\nin a wide variety of tasks. However, many authors have highlighted a limitation in the way GPs\nhandle noisy measurements. Standard GP regression [1] makes two assumptions about the noise\nin datasets: \ufb01rstly that measurements of input points, x, are noise-free, and, secondly, that output\npoints, y, are corrupted by constant-variance Gaussian noise. For some datasets this makes intuitive\nsense: for example, an application in Rasmussen and Williams (2006) [1] is that of modelling CO2\nconcentration in the atmosphere over the last forty years. One can viably assume that the date is\navailable noise-free and the CO2 sensors are affected by signal-independent sensor noise.\nHowever, in many datasets, either or both of these assumptions are not valid and lead to poor mod-\nelling performance. In this paper we look at datasets where the input measurements, as well as the\noutput, are corrupted by noise. Unfortunately, in the GP framework, considering each input location\nto be a distribution is intractable. If, as an approximation, we treat the input measurements as if they\nwere deterministic, and in\ufb02ate the corresponding output variance to compensate, this leads to the\noutput noise variance varying across the input space, a feature often called heteroscedasticity. One\nmethod for modelling datasets with input noise is, therefore, to hold the input measurements to be\ndeterministic and then use a heteroscedastic GP model. This approach has been strengthened by the\nbreadth of research published recently on extending GPs to heteroscedastic data.\nHowever, referring the input noise to the output in this way results in heteroscedasticity with a very\nparticular structure. This structure can be exploited to improve upon current heteroscedastic GP\nmodels for datasets with input noise. One can imagine that in regions where a process is changing\nits output value rapidly, corrupted input measurements will have a much greater effect than in regions\n\nPre-conference version\n\n1\n\n\fwhere the output is almost constant. In other words, the effect of the input noise is related to the\ngradient of the function mapping input to output. This is the intuition behind the model we propose\nin this paper.\nWe \ufb01t a local linear model to the GP posterior mean about each training point. The input noise vari-\nance can then be referred to the output, proportional to the square of the posterior mean function\u2019s\ngradient. This approach is particularly powerful in the case of time-series data where the output\nat time t becomes the input at time t + 1. In this situation, input measurements are clearly not\nnoise-free: the noise on a particular measurement is the same whether it is considered an input or\noutput. By also assuming the inputs are noisy, our model is better able to \ufb01t datasets of this type.\nFurthermore, we can estimate the noise variance on each input dimension, which is often very useful\nfor analysis.\nRelated work lies in the \ufb01eld of heteroscedastic GPs. A common approach to modelling changing\nvariance with a GP, as proposed by Goldberg et al. [2], is to make the noise variance a random\nvariable and attempt to estimate its form at the same time as estimating the posterior mean. Goldberg\net al. suggested using a second GP to model the noise level as a function of the input location.\nKersting et al. [3] improved upon Goldberg et al.\u2019s Monte Carlo training method with a \u201cmost likely\u201d\ntraining scheme and demonstrated its effectiveness; related work includes Yuan and Wahba [4], and\nLe at al. [5] who proposed a scheme to \ufb01nd the variance via a maximum-a-posteriori estimate set\nin the exponential family. Snelson and Ghahramani [6] suggest a different approach whereby the\nimportance of points in a pseudo-training set can be varied, allowing the posterior variance to vary\nas well. Recently Wilson and Ghahramani broadened the scope still further and proposed Copula\nand Wishart Process methods [7, 8].\nAlthough all of these methods could be applied to datasets with input noise, they are designed for a\nmore general class of heteroscedastic problems and so none of them exploits the structure inherent in\ninput noise datasets. Our model also has a further advantage in that training is by marginal likelihood\nmaximisation rather than by an approximate inference method, or one such as maximum likelihood,\nwhich is more susceptible to over\ufb01tting. Dallaire et al. [9] train on Gaussian distributed input points\nby calculating the expected the covariance matrix. However, their method requires prior knowledge\nof the noise variance, rather than inferring it as we do in this paper.\n\n2 The Model\n\nIn this section we formally derive our model, which we refer to as NIGP (noisy input GP).\nLet x and y be a pair of measurements from a process, where x is a D dimensional input to the\nprocess and y is the corresponding scalar output. In standard GP regression we assume that y is a\nnoisy measurement of the actual output of the process \u02dcy,\n\nwhere, \u0001y \u223c N(cid:0)0, \u03c32\n\n(cid:1). In our model, we further assume that the inputs are also noisy measurements\n\ny = \u02dcy + \u0001y\n\n(1)\n\ny\nof the actual input \u02dcx,\n(2)\nwhere \u0001x \u223c N (0, \u03a3x). We assume that each input dimension is independently corrupted by noise,\nthus \u03a3x is diagonal. Under a model f (.), we can write the output as a function of the input in the\nfollowing form,\n\nx = \u02dcx + \u0001x\n\n(3)\nFor a GP model the posterior distribution based on equation 3 is intractable. We therefore consider\na Taylor expansion about the latent state \u02dcx,\n\ny = f (\u02dcx + \u0001x) + \u0001y\n\nf (\u02dcx + \u0001x) = f (\u02dcx) + \u0001T\nx\n\n\u2202f (\u02dcx)\n\n\u2202 \u02dcx\n\n+ . . . (cid:39) f (x) + \u0001T\n\nx\n\n\u2202f (x)\n\n\u2202x\n\n+ . . .\n\n(4)\n\nWe don\u2019t have access to the latent variable \u02dcx so we approximate it with the noisy measurements.\nNow the derivative of a Gaussian Process is another Gaussian Process [10]. Thus, the exact treatment\nwould require the consideration of a distribution over Taylor expansions. Although the resulting dis-\ntribution is not Gaussian, its \ufb01rst and second moments can be calculated analytically. However, these\ncalculations carry a high computational load and previous experiments showed this exact treatment\n\n2\n\n\fprovided no signi\ufb01cant improvement over the much quicker approximate method we now describe.\nInstead we take the derivative of the mean of the GP function, which we will denote \u2202 \u00aff , a D-\ndimensional vector, for the derivative of one GP function value w.r.t. the D-dimensional input, and\n\u2206 \u00aff , an N by D matrix, for the derivative of N function values. Differentiating the mean function\ncorresponds to ignoring the uncertainty about the derivative. If we expand up to the \ufb01rst order terms\nwe get a linear model for the input noise,\n\ny = f (x) + \u0001T\n\nx \u2202 \u00aff + \u0001y\n\n(5)\n\nThe probability of an observation y is therefore,\n\nP (y | f ) = N (f, \u03c32\n\ny + \u2202T\n\n\u00aff \u03a3x \u2202 \u00aff )\n\n\u00aff }(cid:3)\u22121\n\n\u00aff }(cid:3)\u22121\n\ny\n\nyI + diag{\u2206 \u00aff \u03a3x \u2206T\n\n(6)\nWe keep the usual Gaussian Process prior, P (f | X) = N (0, K(X, X)), where K(X, X) is the N\nby N training data covariance matrix and X is an N by D matrix of input observations. Combining\nthese probabilities gives the predictive posterior mean and variance as,\nyI + diag{\u2206 \u00aff \u03a3x \u2206T\n\nE [f\u2217 | X, y, x\u2217] = k(x\u2217, X)(cid:2)K(X, X) + \u03c32\nV [f\u2217 | X, y, x\u2217] = k(x\u2217, x\u2217) \u2212 k(x\u2217, X)(cid:2)K(X, X) + \u03c32\n\nk(X, x\u2217)\n(7)\nto treating the inputs as deterministic and adding a corrective term,\nThis is equivalent\n\u00aff }, to the output noise. The notation \u201cdiag{.}\u201d results in a diagonal matrix, the\ndiag{\u2206 \u00aff \u03a3x \u2206T\nelements of which are the diagonal elements of its matrix argument. Note that if the posterior mean\ngradient is constant across the input space the heteroscedasticity is removed and our model is essen-\ntially identical to a standard GP.\nAn advantage of our approach can be seen in the case of multiple output dimensions. As the input\nnoise levels are the same for each of the output dimensions, our model can use data from all of the\noutputs when learning the input noise variances. Not only does this give more information about the\nnoise variances without needing further input measurements but it also reduces over-\ufb01tting as the\nlearnt noise variances must agree with all E output dimensions.\nFor time-series datasets (where the model has to predict the next state given the current), each\ndimension\u2019s input and output noise variance can be constrained to be the same since the noise level\non a measurement is independent of whether it is an input or output. This further constraint increases\nthe ability of the model to recover the actual noise variances. The model is thus ideally suited to the\ncommon task of multivariate time series modelling.\n\n3 Training\n\nOur model introduces an extra D hyperparameters compared to the standard GP - one noise variance\nhyperparameter per input dimension. A major advantage of our model is that these hyperparameters\ncan be trained alongside any others by maximisation of the marginal likelihood. This approach\nautomatically includes regularisation of the noise parameters and reduces the effect of over-\ufb01tting.\nIn order to calculate the marginal likelihood of the training data we need the posterior distribution,\nand the slope of its mean, at each of the training points. However, evaluating the posterior mean\nfrom equation 7 with x\u2217 \u2208 X, results in an analytically unsolvable differential equation: \u00aff is a\ncomplicated function of \u2206 \u00aff , its own derivative. Therefore, we de\ufb01ne a two-step approach: \ufb01rst we\nevaluate a standard GP with the training data, using our initial hyperparameter settings and ignoring\nthe input noise. We then \ufb01nd the slope of the posterior mean of this GP at each of the training points\n\u00aff }. This process is summarised in\nand use it to add in the corrective variance term, diag{\u2206 \u00aff \u03a3x \u2206T\n\ufb01gures 1a and 1b.\nThe marginal likelihood of the GP with the corrected variance is then computed, along with its\nderivatives with respect to the initial hyperparameters, which include the input noise variances. This\nstep involves chaining the derivatives of the marginal likelihood back through the slope calculation.\nGradient descent can then be used to improve the hyperparameters. Figure 1c shows the GP posterior\nfor the trained hyperparameters and shows how NIGP can reduce output noise level estimates by\ntaking input noise into account. Figure 1d shows the NIGP \ufb01t for the trained hyperparameters.\n\n3\n\n\fFigure 1: Training with NIGP. (a) A standard GP posterior distribution can be computed from an\ninitial set of hyperparameters and a training data set, shown by the blue crosses. The gradients of the\nposterior mean at each training point can then be found analytically. (b) The NIGP method increases\nthe posterior variance by the square of the posterior mean slope multiplied by the current setting of\nthe input noise variance hyperparameter. The marginal likelihood of this \ufb01t is then calculated along\nwith its derivatives w.r.t. initial hyperparameter settings. Gradient descent is used to train the hyper-\nparameters. (c) This plot shows the standard GP posterior using the newly trained hyperparameters.\nComparing to plot (a) shows that the output noise hyperparameter has been greatly reduced. (d) This\n\u00aff }.\nplot shows the NIGP \ufb01t - plot(c) with the input noise corrective variance term, diag{\u2206 \u00aff \u03a3x \u2206T\nPlot (d) is related to plot (c) in the same way that plot (b) is related to plot (a).\n\nTo improve the \ufb01t further we can iterate this procedure: we use the slopes of the current trained\nNIGP, instead of a standard GP, to calculate the effect of the input noise, i.e. replace the \ufb01t in \ufb01gure\n1a with the \ufb01t from \ufb01gure 1d and re-train.\n\n4 Prediction\n\nWe turn now to the task of making predictions at noisy input locations with our model. To be true to\nour model we must use the same process in making predictions as we did in training. We therefore\nuse the trained hyperparameters and the training data to de\ufb01ne a GP posterior mean, which we\ndifferentiate at each test point and each training point. The calculated gradients are then used to add\nin the corrective variance terms. The posterior mean slope at the test points is only used to calculate\nthe variance over observations, where we increase the predictive variance by the noise variances.\nThere is an alternative option, however.\nIf a single test point is considered to have a Gaussian\ndistribution and all the training points are certain then, although the GP posterior is unknown, its\nmean and variance can be calculated exactly [11]. As our model estimates the input noise variance\n\u2217 \u223c N (x\u2217, \u03a3x).\n\u03a3x during training, we can consider a test point to be Gaussian distributed: x(cid:48)\n[11] then gives the mean and variance of the posterior distribution, for a squared exponential kernel\n(equation 12), to be,\n\nyI + \u03a3x\u2202 \u00aff\n\ny\n\nq\n\n(8)\n\n(cid:16)(cid:2)K + \u03c32\n\n\u00aff\u2217 =\n\n2(cid:3)\u22121\n\n(cid:17)T\n\n4\n\n\u221210123456Targeta) Initial hyperparameters & trainingdata define a GP fitb) Extra variance added proportionalto squared slope0123456\u221210123456InputTargetc) Standard GP with NIGP trainedhyperparameters0123456Inputd) The NIGP fit including variancefrom input noise\fwhere,\n\n(xi \u2212 x\u2217)T (\u03a3x + \u039b)\nwhere \u039b is a diagonal matrix of the squared lengthscale hyperparameters.\n\nqi = \u03c32\nf\n\n(cid:12)(cid:12)\u03a3x\u039b\u22121 + I(cid:12)(cid:12)\u2212 1\n\nV [f\u2217] = \u03c32\n\nf \u2212 tr\n\n2\n\n2 exp(cid:0) \u2212 1\n(cid:16)(cid:2)K + \u03c32\n(cid:16)\n\nexp\n\nyI + \u03a3x\u2202 \u00aff\n\n2(cid:3)\u22121\n(z \u2212 x\u2217)T(cid:0)\u039b +\n\n(cid:17)\n\nwith,\n\nQij =\n\nk(xi, x\u2217)k(xj, x\u2217)\n|2\u03a3x\u039b\u22121 + I| 1\n\n2\n\n\u22121 (xi \u2212 x\u2217)(cid:1)\n\n+ \u03b1T Q\u03b1 \u2212 \u00aff 2\u2217\n\nQ\n\nx \u039b(cid:1)\u22121\n\n\u039b\u03a3\u22121\n\n1\n2\n\n(z \u2212 x\u2217)\n\n(cid:17)\n\n(9)\n\n(10)\n\n(11)\n\nwith z = 1\n2 (xi+xj). This method is computationally slower than using equation 7 and is vulnerable\nto worse results if the learnt input noise variance \u03a3x is very different from the true value. However,\nit gives proper consideration to the uncertainty surrounding the test point and exactly computes the\nmoments of the correct posterior distribution. This often leads it to outperform predictions based on\nequation 7.\n\n5 Results\n\nWe tested our model on a variety of functions and datasets, comparing its performance to stan-\ndard GP regression as well as Kersting et al.\u2019s \u2018most likely heteroscedastic GP\u2019 (MLHGP) model, a\nstate-of-the-art heteroscedastic GP model. We used the squared exponential kernel with Automatic\nRelevance Determination,\n\nf exp(cid:0) \u2212 1\n\n(xi \u2212 xj)T \u039b\u22121(xi \u2212 xj)(cid:1)\n\n2\n\nk(xi, xj) = \u03c32\n\n(12)\n\nwhere \u039b is a diagonal matrix of the squared lengthscale hyperparameters and \u03c32\nhyperparameter. Code to run NIGP is available on the author\u2019s website.\n\nf is a signal variance\n\nStandard GP\n\nKersting et al.\n\nThis paper\n\nFigure 2: Posterior distribution for a near-square wave with \u03c3y = 0.05, \u03c3x = 0.3, and 60 data points.\nThe solid line represents the predictive mean and the dashed lines are two standard deviations either\nside. Also shown are the training points and the underlying function. The left image is for standard\nGP regression, the middle uses Kersting et al.\u2019s MLHGP algorithm, the right image shows our model.\nWhile the predictive means are similar, both our model and MLHGP pinch in the variance around the\nlow noise areas. Our model correctly expands the variance around all steep areas whereas MLHGP\ncan only do so where high noise is observed (see areas around x= -6 and x = 1).\n\nFigure 2 shows an example comparison between standard GP regression, Kersting et al.\u2019s MLHGP,\nand our model for a simple near-square wave function. This function was chosen as it has areas\n\n5\n\n\u221210\u221250510\u22121.5\u22121\u22120.500.511.5\u221210\u221250510\u22121.5\u22121\u22120.500.511.5\u221210\u221250510\u22121.5\u22121\u22120.500.511.5\fof steep gradient and near \ufb02at gradient and thus suffers from the heteroscedastic problems we are\ntrying to solve. The posterior means are very similar for the three models, however the variances\nare quite different. The standard GP model has to take into account the large noise seen around the\nsteep sloped areas by assuming large noise everywhere, which leads to the much larger error bars.\nOur model can recover the actual noise levels by taking the input noise into account. Both our model\nand MLHGP pinch the variance in around the \ufb02at regions of the function and expand it around the\nsteep areas. For the example shown in \ufb01gure 2 the standard GP estimated an output noise standard\ndeviation of 0.16 (much too large) compared to our estimate of 0.052, which is very close to the\ncorrect value of 0.050. Our model also learnt an input noise standard deviation of 0.305, very close\nto the real value of 0.300. MLHGP does not produce a single estimate of noise levels.\nPredictions for 1000 noisy measurements were made using each of the models and the log proba-\nbility of the test set was calculated. The standard GP model had a log probability per data point of\n0.419, MLHGP 0.740, and our model 0.885, a signi\ufb01cant improvement. Part of the reason for our\nimprovement over MLHGP can be seen around x = 1: our model has near-symmetric \u2018horns\u2019 in\nthe variance around the corners of the square wave, whereas MLHGP only has one \u2018horn\u2019. This is\nbecause in our model, the amount of noise expected is proportional to the derivative of the mean\nsquared, which is the same for both sides of the square wave. In Kersting et al.\u2019s model the noise\nis estimated from the training points themselves. In this example the training points around x = 1\nhappen to have low noise and so the learnt variance is smaller. The same problem can be seen around\nx = \u22126 where MLHGP has much too small variance. This illustrates an important aspect of our\nmodel: the accuracy in plotting the varying effect of noise is only dependent on the accuracy of the\nmean posterior function and not on an extra, learnt noise model. This means that our model typically\nrequires fewer data points to achieve the same accuracy as MLHGP on input noise datasets. To test\nthe models further, we trained them on a suite of six functions. The functions were again chosen\nto have varying gradients across the input space. The training set consisted of twenty \ufb01ve points in\nthe interval [-10, 10] and the test set one thousand points in the same interval. Trials were run for\ndifferent levels of input noise. For each trial, ten different initialisations of the hyperparameters were\ntried. In order to remove initialisation effects the best initialisations for each model were chosen at\neach step. The entire experiment was run on twenty different random seeds. For our model, NIGP,\nwe trained both a single model for all output dimensions, as well as separate models for each of the\noutputs, to see what the effect of using the cross-dimension information was.\nFigure 3 shows the results for this experiment. The \ufb01gure shows that NIGP performs very well on\nall the functions, always outperforming the standard GP when there is input noise and nearly always\nMLHGP; wherever there is a signi\ufb01cant difference our model is favoured. Training on all the outputs\nat once only gives an improvement for some of the functions, which suggests that, for the others,\nthe input noise levels could be estimated from the individual functions alone. The predictions using\nstochastic test-points, equations 8 and 10, generally outperformed the predictions made using deter-\nministic test-points, equation 7. The RMSEs are quite similar to each other for most of the functions\nas the posterior means are very similar, although where they do differ signi\ufb01cantly, again, it is to\nfavour our model. These results show our model consistently calculates a more accurate predictive\nposterior variance than either a standard GP or a state-of-the-art heteroscedastic GP model.\nAs previously mentioned, our model can be adapted to work more effectively with time-series data,\nwhere the outputs become subsequent inputs. In this situation the input and output noise variance\nwill be the same. We therefore combine these two parameters into one. We tested NIGP on a time-\nseries dataset and compared the two modes (with separate input and output noise hyperparameters\nand with combined) and also to standard GP regression (MLHGP was not available for multiple\ninput dimensions). The dataset is a simulated pendulum without friction and with added noise.\nThere are two variables: pendulum angle and angular velocity. The choice of time interval between\nobservations is important: for very small time intervals, and hence small changes in the angle, the\ndynamics are approximately linear, as sin \u03b8 \u2248 \u03b8. As discussed before, our model will not bring\nany bene\ufb01t to linear dynamics, so in order to see the difference in performance a much longer time\ninterval was chosen. The range of initial angular velocities was chosen to allow the pendulum to\nspin multiple times at the extremes, which adds extra non-linearity. Ten different initialisations\nwere tried, with the one achieving the highest training set marginal likelihood chosen, and the whole\nexperiment was repeated \ufb01fty times with different random seeds.\nThe plots show the difference in log probability of the test set between four versions of NIGP and a\nstandard GP model trained on the same data. All four versions of our model perform better than the\n\n6\n\n\fFigure 3: Comparison of models for suite of 6 test functions. The solid line is our model with\n\u2018deterministic test-point\u2019 predictions, the solid line with triangles is our model with \u2018stochastic test-\npoint\u2019 predictions. Both these models were trained on all 6 functions at once, the respective dashed\nlines were trained on the functions individually. The dash-dot line is a standard GP regression model\nand the dotted line is MLHGP. RMSE has been normalised by the RMS value of the function. In both\nplots lower values indicate better performance. The plots show our model has lower negative log\nposterior predictive than standard GP on all the functions, particularly the exponentially decaying\nsine wave and the multiplication between tan and sin.\n\nstandard GP. Once again the stochastic test point version outperforms the deterministic test points.\nThere was a slight improvement in RMSE using our model but the differences were within two\nstandard deviations of each other. There is also a slight improvement using the combined noise\nlevels although, again, the difference is contained within the error bars.\nA better comparison between the two modes is to look at the input noise variance values recovered.\nThe real noise standard deviations used were 0.2 and 0.4 for the angle and angular velocity respec-\ntively. The model which learnt the variances separately found standard deviations of 0.3265 and\n0.8026 averaged over the trials, whereas the combined model found 0.2429 and 0.8948. This is a\nsigni\ufb01cant improvement on the \ufb01rst dimension. Both modes struggle to recover the correct noise\nlevel on the second dimension and this is probably why the angular velocity prediction performance\nshown in \ufb01gure 4 is worse than the angle prediction performance. Training with more data signif-\n\n7\n\n00.10.20.30.40.50.60.70.80.91\u22122\u22121.5\u22121\u22120.500.51Negative log predictive posteriorsin(x)  00.10.20.30.40.50.60.70.80.91\u22121.5\u22121\u22120.500.5Near\u2212square wave  00.10.20.30.40.50.60.70.80.91\u22121.5\u22121\u22120.500.511.52exp(\u22120.2*x)*sin(x)  00.10.20.30.40.50.60.70.80.91\u22121\u22120.500.511.52Input noise standard deviationNegative log predictive posteriortan(0.15*(x))*sin(x)  00.10.20.30.40.50.60.70.80.9100.511.522.533.54Input noise standard deviation0.2*x2*tanh(cos(x))  00.10.20.30.40.50.60.70.80.91\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.6Input noise standard deviation0.5*log(x2*(sin(2*x)+2)+1)  NIGP DTP all o/pNIGP DTP indiv. o/pNIGP STP indiv. o/pNIGP STP all o/pKersting et al.Standard GP00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91Normalised test set RMSEsin(x)00.10.20.30.40.50.60.70.80.910.050.10.150.20.250.30.350.40.450.50.55Near\u2212square wave00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91exp(\u22120.2*x)*sin(x)00.10.20.30.40.50.60.70.80.910.10.20.30.40.50.60.70.80.911.1Input noise standard deviationNormalised test set RMSEtan(0.15*(x))*sin(x)00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.8Input noise standard deviation0.2*x2*tanh(cos(x))00.10.20.30.40.50.60.70.80.910.020.040.060.080.10.120.140.160.180.20.22Input noise standard deviation0.5*log(x2*(sin(2*x)+2)+1)\fFigure 4: The difference between four versions of NIGP and a standard GP model on a pendulum\nprediction task. DTP stands for deterministic test point and STP is stochastic test point. Comb. and\nsep. indicate whether the model combined the input and output noise parameters or treated them\nseparately. The error bars indicate plus/minus two standard deviations.\n\nicantly improved the recovered noise value although the difference between the two NIGP modes\nthen shrank as there was suf\ufb01cient information to correctly deduce the noise levels separately.\n\n6 Conclusion\n\nThe correct way of training on input points corrupted by Gaussian noise is to consider every input\npoint as a Gaussian distribution. This model is intractable, however, and so approximations must\nbe made. In our model, we refer the input noise to the output by passing it through a local linear\nexpansion. This adds a term to the likelihood which is proportional to the squared posterior mean\ngradient. Not only does this lead to tractable computations but it makes intuitive sense - input\nnoise has a larger effect in areas where the function is changing its output rapidly. The model,\nalthough simple in its approach, has been shown to be very effective, outperforming Kersting et\nal.\u2019s model and a standard GP model in a variety of different regression tasks. It can make use of\nmultiple outputs and can recover a noise variance parameter for each input dimension, which is\noften useful for analysis. In our approximate model, exact inference can be performed as the model\nhyperparameters can be trained simultaneously by marginal likelihood maximisation.\nA proper handling of time-series data would constrain the speci\ufb01c noise levels on each training point\nto be the same for when they are considered inputs and outputs. This would be computationally very\nexpensive however. By allowing input noise and \ufb01xing the input and output noise variances to be\nidentical, our model is a computationally ef\ufb01cient alternative. Our results showed that NIGP gives a\nsubstantial improvement over the often-used standard GP for modelling time-series data.\nIt is important to state that this model has been designed to tackle a particular situation, that of\nconstant-variance input noise, and would not perform so well on a general heteroscedastic prob-\nlem. It could not be expected to improve over a standard GP on problems where noise levels are\nproportional to the function or input value for example. We do not see this limitation as too re-\nstricting however, as we maintain that constant input noise situations (including those where this is\na suf\ufb01cient approximation) are reasonably common. Throughout the paper we have taken particular\ncare to avoid functions or systems which are linear, or approximately linear, as in these cases our\nmodel can be reduced to standard GP regression. However, for the problems for which NIGP has\nbeen designed, such as the various non-linear problems we have presented in this paper, our model\noutperforms current methods.\nThis paper considers a \ufb01rst order Taylor expansion of the posterior mean function. We would expect\nthis to be a good approximation for any function providing the input noise levels are not too large\n(i.e. small perturbations around the point we linearised about). In practice, we could require that\nthe input noise level is not larger than the input characteristic length scale. A more accurate model\ncould use a second order Taylor series, which would still be analytic although computationally\nthe algorithm would then scale with D3 rather than the current D2. Another re\ufb01nement could be\nachieved by doing a Taylor series for the full posterior distribution (not just its mean, as we have\ndone here), again at considerably higher computational cost. These are interesting areas for future\nresearch, which we are actively pursuing.\n\n8\n\n\fReferences\n[1] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine\n\nLearning. MIT Press, 2006.\n\n[2] Paul W. Goldberg, Christopher K. I. Williams, and Christopher M. Bishop. Regression with\n\ninput-dependent noise: A Gaussian Process treatment. NIPS-98, 1998.\n\n[3] Kristian Kersting, Christian Plagemann, Patrick Pfaff, and Wolfram Burgard. Most likely\n\nheteroscedastic Gaussian Process regression. ICML-07, 2007.\n\n[4] Ming Yuan and Grace Wahba. Doubly penalized likelihood estimator in heteroscedastic re-\n\ngression. Statistics and Probability Letter, 69:11\u201320, 2004.\n\n[5] Quoc V. Le, Alex J. Smola, and Stephane Canu. Heteroscedastic Gaussian Process regression.\n\nProcedings of ICML-05, pages 489\u2013496, 2005.\n\n[6] Edward Snelson and Zoubin Ghahramani. Variable noise and dimensionality reduction for\n\nsparse gaussian processes. Procedings of UAI-06, 2006.\n\n[7] A.G. Wilson and Z. Ghahramani. Copula processes. In J. Lafferty, C. K. I. Williams, J. Shawe-\nTaylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Sys-\ntems 23, pages 2460\u20132468. 2010.\n\n[8] Andrew Wilson and Zoubin Ghahramani. Generalised Wishart Processes. In Proceedings of\nthe Twenty-Seventh Conference Annual Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI-11), pages 736\u2013744, Corvallis, Oregon, 2011. AUAI Press.\n\n[9] P. Dallaire, C. Besse, and B. Chaib-draa. Learning Gaussian Process Models from Uncertain\n\nData. 16th International Conference on Neural Information Processing, 2008.\n\n[10] E. Solak, R. Murray-Smith, W.e. Leithead, D.J. Leith, and C.E. Rasmussen. Derivative obser-\n\nvations in Gaussian Process models of dynamic systems. NIPS-03, pages 1033\u20131040, 2003.\n\n[11] Agathe Girard, Carl Edward Rasmussen, Joaquin Quinonero Candela, and Roderick Murray-\nSmith. Gaussian Process priors with incertain inputs - application to multiple-step ahead time\nseries forecasting. Advances in Neural Information Processing Systems 16, 2003.\n\n9\n\n\f", "award": [], "sourceid": 780, "authors": [{"given_name": "Andrew", "family_name": "Mchutchon", "institution": null}, {"given_name": "Carl", "family_name": "Rasmussen", "institution": null}]}