{"title": "Dependent Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 217, "page_last": 224, "abstract": null, "full_text": "                 Dependent Gaussian Processes\n\n\n\n                              Phillip Boyle and Marcus Frean\n                     School of Mathematical and Computing Sciences\n                             Victoria University of Wellington,\n                                  Wellington, New Zealand\n                        {pkboyle,marcus}@mcs.vuw.ac.nz\n\n\n\n                                          Abstract\n\n         Gaussian processes are usually parameterised in terms of their covari-\n         ance functions. However, this makes it difficult to deal with multiple\n         outputs, because ensuring that the covariance matrix is positive definite\n         is problematic. An alternative formulation is to treat Gaussian processes\n         as white noise sources convolved with smoothing kernels, and to param-\n         eterise the kernel instead. Using this, we extend Gaussian processes to\n         handle multiple, coupled outputs.\n\n\n\n1    Introduction\n\nGaussian process regression has many desirable properties, such as ease of obtaining and\nexpressing uncertainty in predictions, the ability to capture a wide variety of behaviour\nthrough a simple parameterisation, and a natural Bayesian interpretation [15, 4, 9]. Be-\ncause of this they have been suggested as replacements for supervised neural networks in\nnon-linear regression [8, 18], extended to handle classification tasks [11, 17, 6], and used\nin a variety of other ways (e.g. [16, 14]). A Gaussian process (GP), as a set of jointly\nGaussian random variables, is completely characterised by a covariance matrix with en-\ntries determined by a covariance function. Traditionally, such models have been specified\nby parameterising the covariance function (i.e. a function specifying the covariance of\noutput values given any two input vectors). In general this needs to be a positive definite\nfunction to ensure positive definiteness of the covariance matrix.\n\nMost GP implementations model only a single output variable. Attempts to handle multiple\noutputs generally involve using an independent model for each output - a method known\nas multi-kriging [18] - but such models cannot capture the structure in outputs that covary.\nAs an example, consider the two tightly coupled outputs shown at the top of Figure 2, in\nwhich one output is simply a shifted version of the other. Here we have detailed knowledge\nof output 1, but sampling of output 2 is sparse. A model that treats the outputs as indepen-\ndent cannot exploit their obvious similarity - intuitively, we should make predictions about\noutput 2 using what we learn from both output 1 and 2.\n\nJoint predictions are possible (e.g. co-kriging [3]) but are problematic in that it is not clear\nhow covariance functions should be defined [5]. Although there are many known positive\ndefinite autocovariance functions (e.g. Gaussians and many others [1, 9]), it is difficult to\ndefine cross-covariance functions that result in positive definite covariance matrices. Con-\ntrast this to neural network modelling, where the handling of multiple outputs is routine.\n\n\f\nAn alternative to directly parameterising covariance functions is to treat GPs as the outputs\nof stable linear filters.         For a linear filter, the output in response to an input x(t) is\n                                 \ny(t) = h(t)      x(t) =                h(t -  )x( )d , where h(t) defines the impulse response of\n                                -\nthe filter and    denotes convolution. Provided the linear filter is stable and x(t) is Gaussian\nwhite noise, the output process y(t) is necessarily a Gaussian process. It is also possible\nto characterise p-dimensional stable linear filters, with M -inputs and N -outputs, by a set\nof M  N impulse responses. In general, the resulting N outputs are dependent Gaussian\nprocesses. Now we can model multiple dependent outputs by parameterising the set of\nimpulse responses for a multiple output linear filter, and inferring the parameter values from\ndata that we observe. Instead of specifying and parameterising positive definite covariance\nfunctions, we now specify and parameterise impulse responses. The only restriction is that\nthe filter be linear and stable, and this is achieved by requiring the impulse responses to be\nabsolutely integrable.\n\nConstructing GPs by stimulating linear filters with Gaussian noise is equivalent to con-\nstructing GPs through kernel convolutions. A Gaussian process V (s) can be constructed\nover a region S by convolving a continuous white noise process X(s) with a smoothing\nkernel h(s),     V (s) = h(s)                X(s) for s  S, [7]. To this can be added a second white\nnoise source, representing measurement uncertainty, and together this gives a model for ob-\nservations Y . This view of GPs is shown in graphical form in Figure 1(a). The convolution\napproach has been used to formulate flexible nonstationary covariance functions [13, 12].\nFurthermore, this idea can be extended to model multiple dependent output processes by\nassuming a single common latent process [7]. For example, two dependent processes V1(s)\nand V2(s) are constructed from a shared dependence on X(s) for s  S0, as follows\n\n\n      V1(s) =                  h1(s - )X()d          and    V2(s) =                 h2(s - )X()d\n                  S S                                                    S S\n                   0      1                                                0      2\n\n\nwhere S = S0  S1  S2 is a union of disjoint subspaces. V1(s) is dependent on X(s), s \nS1 but not X(s), s  S2. Similarly, V2(s) is dependent on X(s), s  S2 but not X(s), s \nS1. This allows V1(s) and V2(s) to possess independent components.\n\nIn this paper, we model multiple outputs somewhat differently to [7]. Instead of assuming\na single latent process defined over a union of subspaces, we assume multiple latent pro-\ncesses, each defined over               p. Some outputs may be dependent through a shared reliance\non common latent processes, and some outputs may possess unique, independent features\nthrough a connection to a latent process that affects no other output.\n\n\n2    Two Dependent Outputs\n\nConsider two outputs Y1(s) and Y2(s) over a region                 p, where s  p. We have N1 obser-\nvations of output 1 and N2 observations of output 2, giving us data D1 = {s1,i , y1,i}N1\n                                                                                                          i=1\nand D2 = {s2,i , y2,i}N2 . We wish to learn a model from the combined data D =\n                                i=1\n{D1, D2} in order to predict Y1(s ) or Y2(s ), for s                     p. As shown in Figure 1(b),\nwe can model each output as the linear sum of three stationary Gaussian processes. One of\nthese (V ) arises from a noise source unique to that output, under convolution with a kernel\nh. A second (U ) is similar, but arises from a separate noise source X0 that influences both\noutputs (although via different kernels, k). The third is additive noise as before.\n\nThus we have Yi(s) = Ui(s) + Vi(s) + Wi(s), where Wi(s) is a stationary Gaussian white\nnoise process with variance, 2, X\n                                        i       0 (s), X1(s) and X2(s) are independent stationary Gaus-\nsian white noise processes, U1(s), U2(s), V1(s) and V2(s) are Gaussian processes given by\nUi(s) = ki(s) X0(s) and Vi(s) = hi(s) Xi(s).\n\n\f\nFigure 1: (a) Gaussian process prior for a single output. The output Y is the sum of two\nGaussian white noise processes, one of which has been convolved ( ) with a kernel (h).\n(b) The model for two dependent outputs Y1 and Y2. All of X0, X1, X2 and the \"noise\"\ncontributions are independent Gaussian white noise sources. Notice that if X0 is forced to\nzero Y1 and Y2 become independent processes as in (a) - we use this as a control model.\n\n\n\nThe k1, k2, h1, h2 are parameterised Gaussian kernels where k1(s) = v1 exp - 1 sT A\n                                                                                                                                                       2    1 s ,\nk2(s) = v2 exp - 1 (s - )T A                                                                                                             sT B\n                              2                               2 (s - ) , and hi(s) = wi exp - 1\n                                                                                                                                     2            is . Note that\nk2(s) is offset from zero by  to allow modelling of outputs that are coupled and translated\nrelative to one another.\n\nWe wish to derive the set of functions CY (d) that define the autocovariance (i = j) and\n                                                                              ij\ncross-covariance (i = j) between the outputs i and j, for a given separation d between\narbitrary inputs sa and sb. By solving a convolution integral, CY (d) can be expressed in a\n                                                                                                                           ij\nclosed form [2], and is fully determined by the parameters of the Gaussian kernels and the\nnoise variances 21 and 22 as follows:\n\n                 CY (d) = CU (d) + CV (d) +                                                                               CY (d) = CU (d)\n                  11               11                         11               ab2\n                                                                                          1                                 12                   12\n\n                 CY (d) = CU (d) + CV (d) +                                                                               CY (d) = CU (d)\n                  22               22                         22               ab2\n                                                                                          2                                 21                   21\n\n\nwhere\n\n                                     p\n                                    2 v2                              1\n                 CU (d) =                     i         exp - dT A\n                   ii                                                                id\n                                     |A                                4\n                                              i|\n                                                   p\n                                   (2) 2 v1v2                                       1\n                 CU (d) =                                             exp - (d - )T (d - )\n                   12                |A                                              2\n                                              1 + A2|\n\n                                                   p\n                                   (2) 2 v1v2                                       1\n                 CU (d) =                                             exp - (d + )T (d + )                                      = CU (-d)\n                   21                                                                                                                      12\n                                     |A                                              2\n                                              1 + A2|\n\n                                    p\n                                    2 w2                              1\n                 CV (d) =                     i          exp - dT B\n                   ii                                                                id\n                                     |B                                4\n                                              i|\n\n\n\nwith  = A1(A1 + A2)-1A2 = A2(A1 + A2)-1A1.\n\nGiven CY (d) then, we can construct the covariance matrices C\n           ij                                                                                                                11 , C12, C21, and C22 as\nfollows\n                                          CY(s                                                              (s                      )\n                                                        ij    i,1 - sj,1)                          CY\n                                                                                                       ij          i,1 - sj,Nj            \n                         C                                      .                         .                           .\n                              ij =                             .                              . .                    .                                      (1)\n                                                               .                                                     .                   \n                                              CY (s                   - s                                    (s            - s            )\n                                                   ij         i,Ni           j,1)                  CY\n                                                                                                       ij          i,Ni           j,Nj\n\n\f\nTogether these define the positive definite symmetric covariance matrix C for the combined\noutput data D:\n                                                                           C\n                                                      C =                       11          C12                                                                 (2)\n                                                                           C21              C22\n\nWe define a set of hyperparameters  that parameterise\n{v1, v2, w1, w2, A1, A2, B1, B2, , 1, 2}. Now, we can calculate the likelihood\n                                                1                           1                                   N1 + N2\n                                        L = - log C -                            yT C-1y -                                     log 2\n                                                2                           2                                         2\n                     where         yT = [y1,1                           y1,N                  y2                            ]\n                                                                                      1           ,1               y2,N2\n\nand C is a function of  and D.\n\nLearning a model now corresponds to either maximising the likelihood L, or maximising\nthe posterior probability P ( | D). Alternatively, we can simulate the predictive distribu-\ntion for y by taking samples from the joint P (y,  | D), using Markov Chain Monte Carlo\nmethods [10].\n\nThe predictive distribution at a point s on output i given  and D is Gaussian with mean\n^\ny and variance 2 given by\n                       ^\n                       y\n\n               ^\n               y = kT C-1y\n\n     and      2 =  - kT C-1k\n               ^\n               y\n\n where          = CY (0) = v2 + w2 + 2\n                      ii                i       i           i\n\n                                                                                                                                                                T\n     and       k = CY (s - s                                (s - s                          )     CY (s - s                               (s - s           )\n                       i1                1,1) . . . C Y\n                                                      i1                        1,N                                   2                             2\n                                                                                       1               i2             ,1) . . . C Y\n                                                                                                                                    i2              ,N2\n\n\n2.1         Example 1 - Strongly dependent outputs over 1d input space\n\nConsider two outputs, observed over a 1d input space. Let Ai = exp(fi),                                                                    Bi = exp(gi),\nand i = exp(i). Our hyperparameters are  = {v1, v2, w1, w2, f1, f2, g1, g2, , 1, 2}\nwhere each element of  is a scalar. As in [2] we set Gaussian priors over .\n\nWe generated N = 48 data points by taking N1 = 32 samples from output 1 and N2 = 16\nsamples from output 2. The samples from output 1 were linearly spaced in the interval\n[-1, 1] and those from output 2 were uniformly spaced in the region [-1, -0.15][0.65, 1].\nAll samples were taken under additive Gaussian noise,  = 0.025. To build our model, we\nmaximised P (|D)  P (D | ) P () using a multistart conjugate gradient algorithm,\nwith 5 starts, sampling from P () for initial conditions.\n\nThe resulting dependent model is shown in Figure 2 along with an independent (control)\nmodel with no coupling (see Figure 1). Observe that the dependent model has learned the\ncoupling and translation between the outputs, and has filled in output 2 where samples are\nmissing. The control model cannot achieve such infilling as it is consists of two independent\nGaussian processes.\n\n\n2.2         Example 2 - Strongly dependent outputs over 2d input space\n\nConsider two outputs, observed over a 2d input space. Let\n                             1                               1\n                    Ai =           I           B                      I               where I is the identity matrix.\n                             2                      i = 2\n                              i                                  i\n\n\nFurthermore, let i = exp(i). In this toy example, we set  = 0, so our hyperparameters\nbecome  = {v1, v2, w1, w2, 1, 2, 1, 21, 2} where each element of  is a scalar.\nAgain, we set Gaussian priors over .\n\n\f\n                                 Output 1 - independent model                                                    Output 2 - independent model\n\n                          True function\n          0.5             Model mean                                                      0.5\n\n\n\n\n          0.4                                                                             0.4\n\n\n\n\n          0.3                                                                             0.3\n\n\n\n\n          0.2                                                                             0.2\n\n\n\n\n          0.1                                                                             0.1\n\n\n\n\n           0                                                                               0\n\n\n\n\n         -0.1                                                                            -0.1\n\n\n\n\n         -0.2                                                                            -0.2\n           -1    -0.8    -0.6      -0.4    -0.2    0    0.2    0.4    0.6    0.8    1      -1    -0.8    -0.6      -0.4    -0.2    0    0.2    0.4    0.6    0.8    1\n\n\n\n\n                                 Output 1 - dependent model                                                      Output 2 - dependent model\n\n\n\n          0.5                                                                             0.5\n\n\n\n\n          0.4                                                                             0.4\n\n\n\n\n          0.3                                                                             0.3\n\n\n\n\n          0.2                                                                             0.2\n\n\n\n\n          0.1                                                                             0.1\n\n\n\n\n           0                                                                               0\n\n\n\n\n         -0.1                                                                            -0.1\n\n\n\n\n         -0.2                                                                            -0.2\n           -1    -0.8    -0.6      -0.4    -0.2    0    0.2    0.4    0.6    0.8    1      -1    -0.8    -0.6      -0.4    -0.2    0    0.2    0.4    0.6    0.8    1\n\n\n\n\n\nFigure 2: Strongly dependent outputs where output 2 is simply a translated version of out-\nput 1, with independent Gaussian noise,  = 0.025. The solid lines represent the model,\nthe dotted lines are the true function, and the dots are samples. The shaded regions repre-\nsent 1 error bars for the model prediction. (top) Independent model of the two outputs.\n(bottom) Dependent model.\n\n\n\nWe generated 117 data points by taking 81 samples from output 1 and 36 samples from out-\nput 2. Both sets of samples formed uniform lattices over the region [-0.9, 0.9][-0.9, 0.9]\nand were taken with additive Gaussian noise,  = 0.025. To build our model, we max-\nimised P (|D) as before.\n\nThe dependent model is shown in Figure 3 along with an independent control model. The\ndependent model has filled in output 2 where samples are missing. Again, the control model\ncannot achieve such in-filling as it is consists of two independent Gaussian processes.\n\n\n3    Time Series Forecasting\n\nConsider the observation of multiple time series, where some of the series lead or predict\nthe others. We simulated a set of three time series for 100 steps each (figure 4) where\nseries 3 was positively coupled to a lagged version of series 1 (lag = 0.5) and negatively\ncoupled to a lagged version of series 2 (lag = 0.6). Given the 300 observations, we built\na dependent GP model of the three time series and compared them with independent GP\nmodels. The dependent GP model incorporated a prior belief that series 3 was coupled to\nseries 1 and 2, with the lags unknown. The independent GP model assumed no coupling\nbetween its outputs, and consisted of three independent GP models. We queried the models\nfor forecasts of the future 10 values of series 3. It is clear from figure 4 that the dependent\nGP model does a far better job at forecasting the dependent series 3. The independent\nmodel becomes inaccurate after just a few time steps into the future. This inaccuracy is\nexpected as knowledge of series 1 and 2 is required to accurately predict series 3. The\n\n\f\nFigure 3: Strongly dependent outputs where output 2 is simply a copy of output 1, with\nindependent Gaussian noise. (top) Independent model of the two outputs. (bottom) Depen-\ndent model. Output 1 is modelled well by both models. Output 2 is modelled well only by\nthe dependent model\n\n\n\ndependent GP model performs well as it has learned that series 3 is positively coupled to a\nlagged version of series 1 and negatively coupled to a lagged version of series 2.\n\n\n4    Multiple Outputs and Non-stationary Kernels\n\nThe convolution framework described here for constructing GPs can be extended to build\nmodels capable of modelling N -outputs, each defined over a p-dimensional input space.\nIn general, we can define a model where we assume M -independent Gaussian white\nnoise processes X1(s) . . . XM (s),     N -outputs U1(s) . . . UN (s),                     and M  N kernels\n{{kmn(s)}M }N\n           m=1 n=1 where s            p. The autocovariance (i = j) and cross-covariance\n(i = j) functions between output processes i and j become\n\n                                       M\n\n                         CU (d) =                  k\n                           ij                           mi(s)kmj (s + d)ds                               (3)\n                                              p\n                                       m=1\n\nand the matrix defined by equation 2 is extended in the obvious way.\n\nThe kernels used in (3) need not be Gaussian, and need not be spatially invariant, or station-\n                                                                               \nary. We require kernels that are absolutely integrable,                . . .          |k(s)|dps < . This\n                                                                 -             -\nprovides a large degree of flexibility, and is an easy condition to uphold. It would seem that\nan absolutely integrable kernel would be easier to define and parameterise than a positive\ndefinite function. On the other hand, we require a closed form of C Y (d) and this may not\n                                                                                     ij\nbe attainable for some non-Gaussian kernels.\n\n\f\n      2\n\n\n      1\n\n\n      0\n\n\n     -1\n\n\n     -2 Series 1\n     -3\n      2\n\n\n      1\n\n\n      0\n\n     -1\n\n\n     -2 Series 2\n     -3\n      2\n\n\n      1\n\n\n      0\n\n\n     -1\n\n\n     -2 Series 3\n     -30         1      2             3     4        5         6         7        8\n\n\n\n\nFigure 4: Three coupled time series, where series 1 and series 2 predict series 3. Forecast-\ning for series 3 begins after 100 time steps where t = 7.8. The dependent model forecast\nis shown with a solid line, and the independent (control) forecast is shown with a broken\nline. The dependent model does a far better job at forecasting the next 10 steps of series 3\n(black dots).\n\n\n\n\n\n5     Conclusion\n\n\nWe have shown how the Gaussian Process framework can be extended to multiple output\nvariables without assuming them to be independent. Multiple processes can be handled\nby inferring convolution kernels instead of covariance functions. This makes it easy to\nconstruct the required positive definite covariance matrices for covarying outputs.\n\nOne application of this work is to learn the spatial translations between outputs. However\nthe framework developed here is more general than this, as it can model data that arises\nfrom multiple sources, only some of which are shared. Our examples show the infilling of\nsparsely sampled regions that becomes possible in a model that permits coupling between\noutputs. Another application is the forecasting of dependent time series. Our example\nshows how learning couplings between multiple time series may aid in forecasting, partic-\nularly when the series to be forecast is dependent on previous or current values of other\nseries.\n\nDependent Gaussian processes should be particularly valuable in cases where one output\nis expensive to sample, but covaries strongly with a second that is cheap. By inferring both\nthe coupling and the independent aspects of the data, the cheap observations can be used\nas a proxy for the expensive ones.\n\n\f\nReferences\n\n [1] ABRAHAMSEN, P. A review of gaussian random fields and correlation functions. Tech. Rep.\n     917, Norwegian Computing Center, Box 114, Blindern, N-0314 Oslo, Norway, 1997.\n\n [2] BOYLE, P., AND FREAN, M. Multiple-output gaussian process regression. Tech. rep., Victoria\n     University of Wellington, 2004.\n\n [3] CRESSIE, N. Statistics for Spatial Data. Wiley, 1993.\n\n [4] GIBBS, M. Bayesian Gaussian Processes for Classification and Regression. PhD thesis, Uni-\n     versity of Cambridge, Cambridge, U.K., 1997.\n\n [5] GIBBS, M., AND MACKAY, D. J.                Efficient implementation of gaussian processes.\n     www.inference.phy.cam.ac.uk/mackay/abstracts/gpros.html, 1996.\n\n [6] GIBBS, M. N., AND MACKAY, D. J. Variational gaussian process classifiers. IEEE Trans. on\n     Neural Networks 11, 6 (2000), 14581464.\n\n [7] HIGDON, D. Space and space-time modelling using process convolutions. In Quantitative\n     methods for current environmental issues (2002), C. Anderson, V. Barnett, P. Chatwin, and\n     A. El-Shaarawi, Eds., Springer Verlag, pp. 3756.\n\n [8] MACKAY, D. J. Gaussian processes: A replacement for supervised neural networks?          In\n     NIPS97 Tutorial, 1997.\n\n [9] MACKAY, D. J. Information theory, inference, and learning algorithms. Cambridge University\n     Press, 2003.\n\n[10] NEAL, R. Probabilistic inference using markov chain monte carlo methods. Tech. Report\n     CRG-TR-93-1, Dept. of Computer Science, Univ. of Toronto, 1993.\n\n[11] NEAL, R. Monte carlo implementation of gaussian process models for bayesian regression and\n     classification. Tech. Rep. CRG-TR-97-2, Dept. of Computer Science, Univ. of Toronto, 1997.\n\n[12] PACIOREK, C. Nonstationary Gaussian processes for regression and spatial modelling. PhD\n     thesis, Carnegie Mellon University, Pittsburgh, Pennsylvania, U.S.A., 2003.\n\n[13] PACIOREK, C., AND SCHERVISH, M. Nonstationary covariance functions for gaussian process\n     regression. Submitted to NIPS, 2004.\n\n[14] RASMUSSEN, C., AND KUSS, M. Gaussian processes in reinforcement learning. In Advances\n     in Neural Information Processing Systems (2004), vol. 16.\n\n[15] RASMUSSEN, C. E. Evaluation of Gaussian Processes and other methods for Non-Linear\n     Regression. PhD thesis, Graduate Department of Computer Science, University of Toronto,\n     1996.\n\n[16] TIPPING, M. E., AND BISHOP, C. M. Bayesian image super-resolution. In Advances in Neural\n     Information Processing Systems (2002), S. Becker S., Thrun and K. Obermayer, Eds., vol. 15,\n     pp. 1303  1310.\n\n[17] WILLIAMS, C. K., AND BARBER, D. Bayesian classification with gaussian processes. IEEE\n     trans. Pattern Analysis and Machine Intelligence 20, 12 (1998), 1342  1351.\n\n[18] WILLIAMS, C. K., AND RASMUSSEN, C. E. Gaussian processes for regression. In Advances\n     in Neural Information Processing Systems (1996), D. Touretzsky, M. Mozer, and M. Hasselmo,\n     Eds., vol. 8.\n\n\f\n", "award": [], "sourceid": 2561, "authors": [{"given_name": "Phillip", "family_name": "Boyle", "institution": null}, {"given_name": "Marcus", "family_name": "Frean", "institution": null}]}