{"title": "Bayesian Alignments of Warped Multi-Output Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 6995, "page_last": 7004, "abstract": "We propose a novel Bayesian approach to modelling nonlinear alignments of time series based on latent shared information. We apply the method to the real-world problem of finding common structure in the sensor data of wind turbines introduced by the underlying latent and turbulent wind field. The proposed model allows for both arbitrary alignments of the inputs and non-parametric output warpings to transform the observations. This gives rise to multiple deep Gaussian process models connected via latent generating processes. We present an efficient variational approximation based on nested variational compression and show how the model can be used to extract shared information between dependent time series, recovering an interpretable functional decomposition of the learning problem. We show results for an artificial data set and real-world data of two wind turbines.", "full_text": "Bayesian Alignments of Warped Multi-Output\n\nGaussian Processes\n\nMarkus Kaiser\nSiemens AG\n\nTechnical University of Munich\nmarkus.kaiser@siemens.com\n\nClemens Otte\nSiemens AG\n\nclemens.otte@siemens.com\n\nThomas Runkler\n\nSiemens AG\n\nTechnical University of Munich\nthomas.runkler@siemens.com\n\nCarl Henrik Ek\n\nUniversity of Bristol\n\ncarlhenrik.ek@bristol.ac.uk\n\nAbstract\n\nWe propose a novel Bayesian approach to modelling nonlinear alignments of time\nseries based on latent shared information. We apply the method to the real-world\nproblem of \ufb01nding common structure in the sensor data of wind turbines introduced\nby the underlying latent and turbulent wind \ufb01eld. The proposed model allows\nfor both arbitrary alignments of the inputs and non-parametric output warpings\nto transform the observations. This gives rise to multiple deep Gaussian process\nmodels connected via latent generating processes. We present an ef\ufb01cient varia-\ntional approximation based on nested variational compression and show how the\nmodel can be used to extract shared information between dependent time series,\nrecovering an interpretable functional decomposition of the learning problem. We\nshow results for an arti\ufb01cial data set and real-world data of two wind turbines.\n\n1 Introduction\n\nMany real-world systems are inherently hierarchical and connected. Ideally, a machine learning\nmethod should model and recognize such dependencies. Take wind power production, which is\none of the major providers for renewable energy today, as an example: To optimize the ef\ufb01ciency\nof a wind turbine the speed and pitch have to be controlled according to the local wind conditions\n(speed and direction). In a wind farm turbines are typically equipped with sensors for wind speed and\ndirection. The goal is to use these sensor data to produce accurate estimates and forecasts of the wind\nconditions at every turbine in the farm. For the ideal case of a homogeneous and very slowly changing\nwind \ufb01eld, the wind conditions at each geometrical position in a wind farm can be estimated using\nthe propagation times (time warps) computed from geometry, wind speed, and direction [21, 4, 18].\nIn the real world, however, wind \ufb01elds are not homogeneous, exhibit global and local turbulences,\nand interfere with the turbines and the terrain inside and outside the farm and further, sensor faults\nmay lead to data loss. This makes it extremely dif\ufb01cult to construct accurate analytical models of\nwind propagation in a farm. Also, standard approaches for extracting such information from data,\ne.g. generalized time warping [24], fail at this task because they rely on a high signal to noise ratio.\nInstead, we want to construct Bayesian nonlinear dynamic data based models for wind conditions\nand warpings which handle the stochastic nature of the system in a principled manner.\nIn this paper, we look at a generalization of this type of problem and propose a novel Bayesian\napproach to \ufb01nding nonlinear alignments of time series based on latent shared information. We view\nthe power production of different wind turbines as the outputs of a multi-output Gaussian process\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(MO-GP) [1] which models the latent wind fronts. We embed this model in a hierarchy, adding a layer\nof non-linear alignments on top and a layer of non-linear warpings [19, 14] below which increases\n\ufb02exibility and encodes the original generative process. We show how the resulting model can be\ninterpreted as a group of deep Gaussian processes with the added bene\ufb01t of covariances between\ndifferent outputs. The imposed structure is used to formulate prior knowledge in a principled manner,\nrestrict the representational power to physically plausible models and recover the desired latent wind\nfronts and relative alignments. The presented model can be interpreted as a group of D deep GPs all\nof which share one layer which is a MO-GP. This MO-GP acts as an interface to share information\nbetween the different GPs which are otherwise conditionally independent.\nThe paper has the following contributions: In Section 2, we propose a hierarchical, warped and aligned\nmulti-output Gaussian process (AMO-GP). In Section 3, we present an ef\ufb01cient learning scheme\nvia an approximation to the marginal likelihood which allows us to fully exploit the regularization\nprovided by our structure, yielding highly interpretable results. We show these properties for an\narti\ufb01cial data set and for real-world data of two wind turbines in Section 4.\n\n2 Model De\ufb01nition\nWe are interested in formulating shared priors over a set of functions {fd}D\nd=1 using GPs, thereby\ndirectly parameterizing their interdependencies. In a traditional GP setting, multiple outputs are\nconsidered conditionally independent given the inputs, which signi\ufb01cantly reduces the computational\ncost but also prevents the utilization of shared information. Such interdependencies can be formulated\nvia convolution processes (CPs) as proposed by Boyle and Frean [5], a generalization of the linear\nmodel of coregionalization (LMC) [13, 7]. In the CP framework, the output functions are the result\nof a convolution of the latent processes wr with smoothing kernel functions Td,r for each output fd,\nde\ufb01ned as\n\n(cid:90)\n\nR(cid:88)\n\nfd(x) =\n\nTd,r(x \u2212 z) \u00b7 wr(z) dz.\n\n(1)\n\nr=1\n\nIn this model, the convolutions of the latent processes generating the different outputs are all\nperformed around the same point x. We generalize this by allowing different alignments of the\nobservations which depend on the position in the input space. This allows us to model the changing\nrelative interaction times for the different latent wind fronts as described in the introduction. We also\nassume that the dependent functions fd are latent themselves and the data we observe is generated\nvia independent noisy nonlinear transformations of their values. Every function fd is augmented with\nan alignment function ad and a warping gd on which we place independent GP priors.\nFor simplicity, we assume that the outputs are evaluated all at the same positions X = {xn}N\nn=1.\nThis can easily be generalized to different input sets for every output. In our application, the xn are\none-dimensional time indices. However, since the model can be generalized to multi-dimensional\ninputs, we do not restrict ourselves to the one-dimensional case. We note that in the multi-dimensional\ncase, reasoning about priors on alignments can be challenging. We call the observations associated\nwith the d-th function yd and use the stacked vector y = (y1, . . . , yD) to collect the data of all\noutputs. The \ufb01nal model is then given by\n\nyd = gd(fd(ad(X))) + \u0001d,\n\n(2)\n\nwhere \u0001d \u223c N (0, \u03c32\ny,dI) is a noise term. The functions are applied element-wise. This encodes the\ngenerative process described above: For every turbine yd, observations at positions X are generated\nby \ufb01rst aligning to the latent wind fronts using ad, applying the front in fd, imposing turbine-speci\ufb01c\ncomponents gd and adding noise in \u0001d.\nWe assume independence between ad and gd across outputs and apply GP priors of the form ad \u223c\nGP(id, ka,d) and gd \u223c GP(id, kg,d). By setting the prior mean to the identity function id(x) = x,\nthe standard CP model is our default assumption. During learning, the model can choose the different\nad and gd in a way to reveal the independent shared latent processes {wr}R\nr=1 on which we also place\nGP priors wr \u223c GP(0, ku,r). Similar to Boyle and Frean [5], we assume the latent processes to be\nindependent white noise processes by setting cov[wr(z), wr(cid:48)(z(cid:48))] = \u03b4rr(cid:48)\u03b4zz(cid:48). Under this prior, the\n\nfd are also GPs with zero mean and cov[fd(x), fd(cid:48)(x(cid:48))] =(cid:80)R\n\n(cid:82) Td,r(x \u2212 z)Td(cid:48),r(x(cid:48) \u2212 z) dz.\n\nr=1\n\n2\n\n\fFigure 1: The graphical model of AMO-GP\nwith variational parameters (blue). A CP,\ninformed by R latent processes, models\nshared information between multiple data\nsets with nonlinear alignments and warp-\nings. This CP connects multiple deep GPs\nthrough a shared layer.\n\nFigure 2: An arti\ufb01cial example of hierarchi-\ncal composite data with multiple observa-\ntions of shared latent information. This\nhierarchy generates two data sets using\na dampened sine function which is never\nobserved directly.\n\nUsing the squared exponential kernel for all Td,r, the integral can be shown to have a closed form\nsolution. With {\u03c3d,r, (cid:96)d,r} denoting the kernel hyper parameters associated with Td,r, it is given by\n\n(cid:32)\n\n(cid:33)\n\nK(cid:88)\n\nk=1\n\n(xk \u2212 x(cid:48)\n\u02c6(cid:96)2\nd,d(cid:48),r,k\n\nk)2\n\n,\n\n(3)\n\ncov[fd(x), fd(cid:48)(x(cid:48))] =\n\n(2\u03c0) K\n\n2 \u03c3d,r\u03c3d(cid:48),r\n\nr=1\n\nk=1\n\n\u02c6(cid:96)\u22121\nd,d(cid:48),r,k\n\nexp\n\n\u2212 1\n2\n\nwhere x is K-dimensional and \u02c6(cid:96)d,d(cid:48),r,k =\n\n(cid:96)2\nd,r,k + (cid:96)2\n\nd(cid:48),r,k.\n\nR(cid:88)\n\n(cid:81)K\n(cid:113)\n\n3 Variational Approximation\n\nSince exact inference in this model is intractable, we present a variational approximation to the\nmodel\u2019s marginal likelihood in this section. A detailed derivation of the variational bound can be\nfound in Appendix A. Analogously to y, we denote the random vectors which contain the function\nvalues of the respective functions and outputs as a and f. The joint probability distribution of the\ndata can then be written as\np(y, f , a| X) =\n\nD(cid:89)\n\nd=1\n\np(f | a)\n\np(yd | fd) p(ad | X),\n\nad | X \u223c N (X, Ka,d + \u03c32\nf | a \u223c N (0, Kf + \u03c32\nf I),\nyd | fd \u223c N (fd, Kg,d + \u03c32\n\na,dI),\n\ny,dI).\n\n(4)\n\nHere, we use K to refer to the Gram matrices corresponding to the respective GPs. All but the\nconvolution processes factorize over the different levels of the model as well as the different outputs.\n\n3.1 Variational Lower Bound\n\nTo approximate a single deep GP, that is a single string of GPs stacked on top of each other,\nHensman and Lawrence [11] proposed nested variational compression in which every GP in the\nhierarchy is handled independently. In order to arrive at their lower bound they make two variational\napproximations. First, they consider a variational approximation q(\u02c6a, u) = p(\u02c6a| u) q(u) to the true\nposterior of a single GP \ufb01rst introduced by Titsias [22]. In this approximation, the original model is\naugmented with inducing variables u together with their inducing points Z which are assumed to\n\n3\n\nmadmfdmgdXdadfdgdydXd(cid:48)ad(cid:48)fd(cid:48)gd(cid:48)yd(cid:48)mad(cid:48)mfd(cid:48)mgd(cid:48)w1...wR0101Xa0101Xa01-11af-11-11fg1-11-11fg201-11Xy101-11Xy2\fbe latent observations of the same function and are thus jointly Gaussian with the observed data. In\ncontrast to [22], the distribution q(u) is not chosen optimally but optimized using the closed form\nq(u) \u223c N (u| m, S). This gives rise to the Scalable Variational GP presented in [10]. Second, in\norder to apply this variational bound for the individual GPs recursively, uncertainties have to be\npropagated through subsequent layers and inter-layer cross-dependencies are avoided using another\nvariational approximation. The variational lower bound for the AMO-GP is given by\n\nlog p(y | X, Z, u) \u2265 D(cid:88)\n(cid:16)\n\nd=1\n\nlog N(cid:16)\n\nyd\n\n(cid:12)(cid:12)(cid:12) \u03a8g,dK\n(cid:17)(cid:17) \u2212 D(cid:88)\n\n\u22121\nug,dug,d\n\nmg,d, \u03c32\n\ny,dI\n\n(cid:16)\n\nd=1\n\ny,d\n\n\u03a6f K\n\n\u22121\nuf uf\n\n1\n2\u03c32\n\n\u03c8f \u2212 tr\n\n\u03c8g,d \u2212 tr\n\n(cid:16)\n(cid:16)\nKL(q(ua,d)(cid:107) p(ua,d)) \u2212 KL(q(uf )(cid:107) p(uf )) \u2212 D(cid:88)\n(cid:16)(cid:0)\u03a6f \u2212 \u03a8T\n(cid:17)\n(cid:1) K\n(cid:1) K\n(cid:16)(cid:0)\u03a6g,d \u2212 \u03a8T\n(cid:0)mg,dmT\n\n(cid:0)mf mT\n(cid:1) K\n\nf + Sf\n\n\u22121\nuf uf\n\n\u22121\nuf uf\n\nf \u03a8f\n\n\u22121\nug,dug,d\n\ng,d\u03a8g,d\n\nd=1\n\ntr\n\n1\n2\u03c32\n\ny,d\n\ntr\n\n\u2212 1\n2\u03c32\nf\n\n\u2212 D(cid:88)\n\u2212 D(cid:88)\n\n\u2212 1\n2\u03c32\nf\n\nd=1\n\nd=1\n\n(cid:17) \u2212 D(cid:88)\n\nd=1\n\ntr(\u03a3a,d)\n\n1\n2\u03c32\n\na,d\n\n(cid:17)(cid:17)\n\n\u03a6g,dK\n\n\u22121\nug,dug,d\n\nKL(q(uy,d)(cid:107) p(uy,d))\n\n(5)\n\n(cid:1) K\n\n\u22121\nug,dug,d\n\n(cid:17)\n\n,\n\ng,d + Sg,d\n\nwhere KL denotes the KL-divergence. A detailed derivation can be found in Appendix A. The bound\ncontains one Gaussian \ufb01t term per output dimension and a series of regularization terms for every\nGP in the hierarchy. The KL-divergences connect the variational approximations to the prior and the\ndifferent trace terms regularize the variances of the different GPs (for a detailed discussion see [11]).\nThis bound depends on the hyper parameters of the kernel and likelihood {(cid:96), \u03c3} and the variational\nparameters {Zl,d, ml,d, Sl,d | l \u2208 {a, f , d}, d \u2208 [D]}.\nThe bound can be calculated in O(N M 2) time and factorizes along the data points which enables\nstochastic optimization. Since every of the N data points is associated with one of the D outputs,\nthe computational cost of the model is independent of D. Information is only shared between the\ndifferent outputs using the inducing points in f. As the different outputs share a common function,\nincreasing D allows us to reduce the number of variational parameters per output, because the shared\nfunction can still be represented completely.\nA central component of this bound are expectations over kernel matrices, the three \u03a8-statistics\n\u03c8f = Eq(a)[tr(Kf f )], \u03a8f = Eq(a)[Kf u] and \u03a6f = Eq(a)[Kuf Kf u]. Closed form solutions for\nthese statistics depend on the choice of kernel and are known for speci\ufb01c kernels, such as linear\nor RBF kernels, for example shown in [8]. In the following subsection we will give closed form\nsolutions for these statistics required in the shared CP-layer of our model.\n\n3.2 Convolution Kernel Expectations\n\nThe uncertainty about the \ufb01rst layer is captured by the variational distribution of the latent alignments\na given by q(a) \u223c N (\u00b5a, \u03a3a). Every aligned point in a corresponds to one output of f and\nultimately to one of the yd. Since the closed form of the multi output kernel depends on the choice of\noutputs, we will use the notation \u02c6f (an) to denote fd(an) such that an is associated with output d.\nFor simplicity, we only consider one single latent process wr. Since the latent processes are indepen-\ndent, the results can easily be generalized to multiple processes. Then, \u03c8f is given by\n\n\u03c8f = Eq(a)[tr(Kf f )] =\n\n\u02c6\u03c32\nnn.\n\n(6)\n\nSimilar to the notation \u02c6f (\u00b7), we use the notation \u02c6\u03c3nn(cid:48) to mean the variance term associated with\nthe covariance function cov[ \u02c6f (an), \u02c6f (an(cid:48) )] as shown in (3). The expectation \u03a8f = Eq(a)[Kf u]\n\nn=1\n\n4\n\nN(cid:88)\n\n\fconnecting the alignments and the pseudo inputs is given by\n\n(cid:115)\n\n\u03a8f = Eq(a)[Kf u], with\n(\u03a3a)\u22121\n\nnn\n\n(\u03a8f )ni = \u02c6\u03c32\n\nni\n\n\u02c6(cid:96)ni + (\u03a3a)\u22121\n\nnn\n\nexp\n\n(cid:32)\n\n\u2212 1\n2\n\n(\u03a3a)\u22121\n(\u03a3a)\u22121\n\n\u02c6(cid:96)ni\nnn + \u02c6(cid:96)ni\n\nnn\n\n(cid:33)\n\n,\n\n((\u00b5a)n \u2212 Zi)2\n\n(7)\n\nwhere \u02c6(cid:96)ni is the combined length scale corresponding to the same kernel as \u02c6\u03c3ni. Lastly, \u03a6f =\nEq(a)[Kuf Kf u] connects alignments and pairs of pseudo inputs with the closed form\n\n(8)\n\n\u03a6f = Eq(a)[Kuf Kf u], with\n\n(cid:115)\n\nN(cid:88)\n\nn=1\n\n(\u03a6f )ij =\n\n\u02c6\u03c32\nni \u02c6\u03c32\nnj\n\n(\u03a3a)\u22121\n\nnn\n\n\u02c6(cid:96)ni + \u02c6(cid:96)nj + (\u03a3a)\u22121\n\n\u02c6(cid:96)ni\n\u02c6(cid:96)nj\n\u02c6(cid:96)ni + \u02c6(cid:96)nj\n\n2\n\nnn\n\nexp\n\n(cid:32)\n(\u00b5a)n \u2212 \u02c6(cid:96)niZi + \u02c6(cid:96)njZj\n\n\u02c6(cid:96)ni + \u02c6(cid:96)nj\n\n(cid:33)2\uf8f6\uf8f8 .\n\n\u2212 1\n2\n\n(\u03a3a)\u22121\n(\u03a3a)\u22121\n\nnn(\u02c6(cid:96)ni + \u02c6(cid:96)nj)\nnn + \u02c6(cid:96)ni + \u02c6(cid:96)nj\n\n(Zi \u2212 Zj)2\n\n\uf8eb\uf8ed\u2212 1\n\nThe \u03a8-statistics factorize along the data and we only need to consider the diagonal entries of \u03a3a.\nIf all the data belong to the same output, the \u03a8-statistics of the squared exponential kernel can be\nrecovered as a special case. This case is used for the output-speci\ufb01c warpings g.\n\n3.3 Model Interpretation\n\nThe graphical model shown in Figure 1 illustrates that the presented model can be interpreted as\na group of D deep GPs all of which share one layer which is a CP. This CP acts as an interface to\nshare information between the different GPs which are otherwise conditionally independent. This\nmodelling-choice introduces a new quality to the model when compared to standard deep GPs with\nmultiple output dimensions, since the latter are not able in principle to learn dependencies between the\ndifferent outputs. Compared to standard multi-output GPs, the AMO-GP introduces more \ufb02exibility\nwith respect to the shared information. CPs make strong assumptions about the relative alignments\nof the different outputs, that is, they assume constant time-offsets. The AMO-GP extends this by\nintroducing a principled Bayesian treatment of general nonlinear alignments ad on which we can\nplace informative priors derived from the problem at hand. Together with the warping layers gd, our\nmodel can learn to share knowledge in an informative latent space learnt from the data.\nAlternatively, this model can be interpreted as a shared and warped latent variable model with a\nvery speci\ufb01c prior: The indices X are part of the prior for the latent space ad(X) and specify a\nsense of order for the different data points y which is augmented with uncertainty by the alignment\nfunctions. Using this order, the convolution processes enforce the covariance structure for the different\ndatapoints speci\ufb01ed by the smoothing kernels.\nIn order to derive an inference scheme, we need the ability to propagate uncertainties about the\ncorrect alignments and latent shared information through subsequent layers. We adapted the approach\nof nested variational compression by Hensman and Lawrence [11], which is originally concerned\nwith a single deep GP. The approximation is expanded to handle multiple GPs at once, yielding\nthe bound in (5). The bound re\ufb02ects the dependencies of the different outputs as the sharing of\ninformation between the different deep GPs is approximated through the shared inducing variables\nuf,d. Our main contribution for the inference scheme is the derivation of a closed-form solution for\nthe \u03a8-statistics of the convolution kernel in (6) to (8).\n\n4 Experiments\n\nIn this section we show how to apply the AMO-GP to the task of \ufb01nding common structure in time\nseries observations. In this setting, we observe multiple time series Td = (Xd, yd) and assume that\nthere exist latent time series which determine the observations.\nWe will \ufb01rst apply the AMO-GP to an arti\ufb01cial data set in which we de\ufb01ne a decomposed system\nof dependent time series by specifying a shared latent function generating the observations together\n\n5\n\n\f(a) Shallow GP with RBF kernel.\n\n(b) Multi-Output GP with dependent RBF kernel.\n\n(c) Deep GP with RBF kernels.\n\n(d) AMO-GP with (dependent) RBF kernels.\n\nFigure 3: A comparison of the AMO-GP with other GP models. The plots show mean predictions\nand a shaded area of two standard deviations. If available, the ground truth is displayed as a\ndashed line. Additional lines are noiseless samples drawn from the model. The shallow and\ndeep GPs in Figures 3a and 3c model the data independently and revert back to the prior in y2.\nBecause of the nonlinear alignment, a multi-output GP cannot model the data in Figure 3b. The\nAMO-GP in Figure 3d recovers the alignment and warping and shares information between the\ntwo outputs.\n\nwith relative alignments and warpings for the different time series. We will show that our model is\nable to recover this decomposition from the training data and compare the results to other approaches\nof modeling the data. Then we focus on a real world data set of a neighbouring pair of wind turbines\nin a wind farm, where the model is able to recover a representation of the latent prevailing wind\ncondition and the relative timings of wind fronts at the two turbines.\n\n4.1 Arti\ufb01cial data set\nOur data set consists of two time series T1 and T2 generated by a dampened sine function. We choose\nthe alignment of T1 and the warping of T2 to be the identity in order to prevent us from directly\nobserving the latent function and apply a sigmoid warping to T1. The alignment of T2 is selected to\nbe a quadratic function. Figure 2 shows a visualization of this decomposed system of dependent time\n\n6\n\n01-11f01-11XXy1y2-11y101-11Xy201-11f01-11XXy1y2-11y101-11Xy201-22Xa01-22X-22-55f-22-55-33-11g1-33-11g2-11y101-11Xy201-22Xa01-1.52.5X-22-24f-22-1.52.5-24-11g1-1.52.5-11g2-11y101-11Xy2\fTable 1: Test-log-likelihoods for the models presented in Section 4.\n\nExperiment\n\nTest set\n\nGP\n\nMO-GP\n\nArti\ufb01cial\n\nWind\n\n[0.7, 0.8] \u2286 T1\n[0.35, 0.65] \u2286 T2\n[40, 45] \u2286 T2\n[65, 75] \u2286 T2\n\n-0.12\n-0.19\n-4.42\n-7.26\n\n-0.053\n-5.66\n-2.31\n-0.73\n\nDGP\n\n0.025\n-0.30\n-1.80\n-1.93\n\nAMO-GP\n\n(Ours)\n1.54\n0.72\n-1.43\n-0.69\n\nseries. To obtain training data we uniformly sampled 500 points from the two time series and added\nGaussian noise. We subsequently removed parts of the training sets to explore the generalization\nbehaviour of our model, resulting in |T1| = 450 and |T2| = 350.\nWe use this setup to train our model using squared exponential kernels both in the conditionally\nindependent GPs ad and gd and as smoothing kernels in f. We can always choose one alignment\nand one warping to be the identity function in order to constrain the shared latent spaces a and f\nand provide a reference the other alignments and warpings will be relative to. Since we assume\nour arti\ufb01cial data simulates a physical system, we apply the prior knowledge that the alignment\nand warping processes have slower dynamics compared to the shared latent function which should\ncapture most of the observed dynamics. To this end we applied priors to the ad and gd which prefer\nlonger length scales and smaller variances compared to f. Otherwise, the model could easily get\nstuck in local minima like choosing the upper two layers to be identity functions and model the time\nseries independently in the gd. Additionally, our assumption of identity mean functions prevents\npathological cases in which the complete model collapses to a constant function.\nFigure 3d shows the AMO-GP\u2019s recovered function decomposition and joint predictions. The model\nsuccessfully recovered a shared latent dampened sine function, a sigmoid warping for the \ufb01rst time\nseries and an approximate quadratic alignment function for the second time series. In Figures 3a\nto 3c, we show the training results of a standard GP, a multi-output GP and a three-layer deep GP\non the same data. For all of these models, we used RBF kernels and, in the case of the deep GP,\napplied priors similar to our model in order to avoid pathological cases. In Table 1 we report test\nlog-likelihoods for the presented models, which illustrate the qualitative differences between the\nmodels. Because all models are non-parametric and converge well, repeating the experiments with\ndifferent initializations leads to very similar likelihoods.\nBoth the standard GP and deep GP cannot learn dependencies between time series and revert back to\nthe prior where no data is available. The deep GP has learned that two layers are enough to model the\ndata and the resulting model is essentially a Bayesian warped GP which has identi\ufb01ed the sigmoid\nwarping for T1. Uncertainties in the deep GP are placed in the middle layer areas where no data are\navailable for the respective time series, as sharing information between the two outputs is impossible.\nIn contrast to the other two models, the multi-output GP can and must share information between\nthe two time series. As discussed in Section 2 however, it is constrained to constant time-offsets and\ncannot model the nonlinear alignment in the data. Because of this, the model cannot recover the\nlatent sine function and can only model one of the two outputs.\n\n4.2 Pairs of wind turbines\n\nThis experiment is based on real data recorded from a pair of neighbouring wind turbines in a wind\nfarm. The two time series T1 and T2 shown in gray in Figure 4 record the respective power generation\nof the two turbines over the course of one and a half hours, which was smoothed slightly using a\nrolling average over 60 seconds. There are 5400 data points for the \ufb01rst turbine (blue) and 4622 data\npoints for the second turbine (green). We removed two intervals (drawn as dashed lines) from the\nsecond turbine\u2019s data set to inspect the behaviour of the model with missing data. This allows us to\nevaluate and compare the generative properties of our model in Figure 5.\nThe power generated by a wind turbine is mainly dependent on the speed of the wind fronts interacting\nwith the turbine. For system identi\ufb01cation tasks concerned with the behaviour of multiple wind\nturbines, associating the observations on different turbines due to the same wind fronts is an important\ntask. However it is usually not possible to directly measure these correspondences or wind propagation\n\n7\n\n\fFigure 4: The joint posterior for two time series y1 and y2 of power production for a pair of wind\nturbines. The top and bottom plots show the two observed time series with training data and\ndashed missing data. The AMO-GP recovers an uncertain relative alignment of the two time\nseries shown in the middle plot. High uncertainty about the alignment is placed in areas where\nmultiple explanations are plausible due to the high amount of noise or missing data.\n\n(a) Samples from a GP.\n\n(b) Samples from a MO-GP.\n\n(c) Samples from a DGP.\n\n(d) Samples from the AMO-GP.\n\nFigure 5: A comparison of noiseless samples drawn from a GP, a MO-GP, a DGP and the AMO-\nGP. The separation of uncertainties implied by the model structure of AMP-GP gives rise to an\ninformative model. Since the uncertainty in the generative process is mainly placed in the relative\nalignment shown in Figure 4, all samples in Figure 5d resemble the underlying data in structure.\n\nspeeds between turbines, which means that there is no ground truth available. An additional problem\nis that the shared latent wind conditions are superimposed by turbine-speci\ufb01c local turbulences.\nSince these local effects are of comparable amplitude to short-term changes of wind speed, it is\nchallenging to decide which parts of the signal to explain away as noise and which part to identify as\nthe underlying shared process.\nOur goal is the simultaneous learning of the uncertain alignment in time a and of the shared latent\nwind condition f. Modelling the turbine-speci\ufb01c parts of the signals is not the objective, so they\nneed to be explained by the Gaussian noise term. We use a squared exponential kernel as a prior\nfor the alignment functions ad and as smoothening kernels in f. For the given data set we can\nassume the output warpings gd to be linear functions because there is only one dimension, the power\ngeneration, which in this data set is of similar shape for both turbines. Again we encode a preference\n\n8\n\n0.40.81.2y1[MW]\u22127.507.5\u2206X[min]01530456075900.40.81.2X[min]y2[MW]0.60.811.2y2[MW]0.60.811.2y2[MW]39470.60.811.2X[min]y2[MW]39470.60.811.2X[min]y2[MW]\ffor alignments with slow dynamics with a prior on the length scales of ad. As the signal has turbine-\nspeci\ufb01c autoregressive components, plausible alignments are not unique. To constrain the AMO-GP,\nwe want it to prefer alignments close to the identity function which we chose as a prior mean function.\nFigure 4 shows the joint model learned from the data in which a1 is chosen to be the identity function.\nThe possible alignments identi\ufb01ed match the physical conditions of the wind farm. For the given\nturbines, time offsets of up to six minutes are plausible and for most wind conditions, the offset is\nexpected to be close to zero. For areas where the alignment is quite certain however, the two time\nseries are explained with comparable detail. The model is able to recover unambiguous associations\nwell and successfully places high uncertainty on the alignment in areas where multiple explanations\nare plausible due to the noisy signal.\nAs expected, the uncertainty about the alignment also grows where data for the second time series is\nmissing. This uncertainty is propagated through the shared function and results in higher predictive\nvariances for the second time series. Because of the factorization in the model however, we can\nrecover the uncertainties about the alignment and the shared latent function separately. Figure 5\ncompares samples drawn from our model with samples drawn from a GP, a MO-GP and a DGP.\nThe GP reverts to their respective priors when data is missing, while the MO-GP does not handle\nshort-term dynamics and smoothens the signal enough such that the nonlinear alignment can be\napproximated as constant. Samples drawn from a DGP model showcase the complexity of a DGP\nprior. Unconstrained composite GPs are hard to reason about and make the model very \ufb02exible in\nterms of representable functions. Since the model\u2019s evidence is very broad, the posterior is uninformed\nand inference is hard. Additionally, as discussed in Appendix B and [11], the nested variational\ncompression bound tends to loosen with high uncertainties. AMO-GP shows richer structure: Due\nto the constraints imposed by the model, more robust inference leads to a more informed model.\nSamples show that it has learned that a maximum which is missing in the training data has to exist\nsomewhere, but the uncertainty about the correct alignment due to the local turbulence means that\ndifferent samples place the maximum at different locations in X-direction.\n\n5 Conclusion\n\nWe have proposed the warped and aligned multi-output Gaussian process (AMO-GP), in which MO-\nGPs are embedded in a hierarchy to \ufb01nd shared structure in latent spaces. We extended convolution\nprocesses [5] with conditionally independent Gaussian processes on both the input and output sides,\ngiving rise to a highly structured deep GP model. This structure can be used to both regularize the\nmodel and encode expert knowledge about speci\ufb01c parts of the system. By applying nested variational\ncompression [11] to inference in these models, we presented a variational lower bound which\ncombines Bayesian treatment of all parts of the model with scalability via stochastic optimization.\nWe compared the model with GPs, deep GPs and multi-output GPs on an arti\ufb01cial data set and showed\nhow the richer model-structure allows the AMO-GP to pick up on latent structure which the other\napproaches cannot model. We then applied the AMO-GP to real world data of two wind turbines\nand used the proposed hierarchy to model wind propagation in a wind farm and recover information\nabout the latent non homogeneous wind \ufb01eld. With uncertainties decomposed along the hierarchy,\nour approach handles ambiguities introduced by the stochasticity of the wind in a principled manner.\nThis indicates the AMO-GP is a good approach for these kinds of dynamical system, where multiple\nmisaligned sensors measure the same latent effect.\n\n6 Acknowledgement\n\nThe project this report is based on was supported with funds from the German Federal Ministry of\nEducation and Research under project number 01IB15001. The sole responsibility for the reports\ncontents lies with the authors.\n\n9\n\n\fReferences\n[1] Mauricio A. Alvarez, Lorenzo Rosasco, and Neil D. Lawrence. \u201cKernels for Vector-Valued\n\nFunctions: a Review\u201d. In: arXiv:1106.6251 [cs, math, stat] (June 2011). arXiv: 1106.6251.\n\n[2] Mauricio A. Alvarez et al. \u201cEf\ufb01cient Multioutput Gaussian Processes through Variational\n\nInducing Kernels.\u201d In: AISTATS. Vol. 9. 2010, pp. 25\u201332.\n\n[3] Mauricio Alvarez and Neil D. Lawrence. \u201cSparse convolved Gaussian processes for multi-\noutput regression\u201d. In: Advances in neural information processing systems. 2009, pp. 57\u201364.\n[4] Eilyan Bitar and Pete Seiler. \u201cCoordinated control of a wind turbine array for power maximiza-\n\ntion\u201d. In: American Control Conference (ACC), 2013. IEEE, 2013, pp. 2898\u20132904.\n\n[5] Phillip Boyle and Marcus R. Frean. \u201cDependent Gaussian Processes.\u201d In: NIPS. Vol. 17. 2004,\n\npp. 217\u2013224.\n\n[6] Phillip Boyle et al. Multiple output gaussian process regression. Tech. rep. 2005.\n[7] Timothy C. Coburn. Geostatistics for natural resources evaluation. Taylor & Francis Group,\n\n2000.\n\n[8] Andreas C. Damianou and Neil D. Lawrence. \u201cDeep Gaussian Processes\u201d. In: arXiv:1211.0358\n\n[cs, math, stat] (Nov. 2012). arXiv: 1211.0358.\n\n[9] David Duvenaud et al. Avoiding pathologies in very deep networks. 2014.\n[10]\n\nJames Hensman, Nicolo Fusi, and Neil D. Lawrence. \u201cGaussian Processes for Big Data\u201d. In:\narXiv:1309.6835 [cs, stat] (Sept. 2013).\nJames Hensman and Neil D. Lawrence. \u201cNested Variational Compression in Deep Gaussian\nProcesses\u201d. In: arXiv:1412.1370 [stat] (Dec. 2014). arXiv: 1412.1370.\nJames Hensman, Alex Matthews, and Zoubin Ghahramani. \u201cScalable Variational Gaussian\nProcess Classi\ufb01cation\u201d. In: arXiv:1411.2005 [stat] (Nov. 2014). arXiv: 1411.2005.\n[13] Andre G. Journel and Ch J. Huijbregts. Mining geostatistics. Academic press, 1978.\n[14] Miguel L\u00e1zaro-Gredilla. \u201cBayesian warped Gaussian processes\u201d. In: Advances in Neural\n\n[11]\n\n[12]\n\nInformation Processing Systems. 2012, pp. 1619\u20131627.\n\n[15] Alexander G. de G. Matthews et al. \u201cGP\ufb02ow: A Gaussian process library using TensorFlow\u201d.\n\nIn: Journal of Machine Learning Research 18.40 (2017), pp. 1\u20136.\n\n[16] Carl Edward Rasmussen and Christopher K I Williams. Gaussian Processes for Machine\n\nLearning (Adaptive Computation and Machine Learning). The MIT Press, 2006.\n\n[17] Hugh Salimbeni and Marc Deisenroth. \u201cDoubly Stochastic Variational Inference for Deep\n\n[18]\n\nGaussian Processes\u201d. In: arXiv:1705.08933 [stat] (May 2017). arXiv: 1705.08933.\nJ. G. Schepers and S. P. Van der Pijl. \u201cImproved modelling of wake aerodynamics and\nassessment of new farm control strategies\u201d. In: Journal of Physics: Conference Series. Vol. 75.\nIOP Publishing, 2007, p. 012039.\n\n[19] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. \u201cWarped Gaussian Pro-\n\ncesses\u201d. In: MIT Press, 2004, pp. 337\u2013344.\nJasper Snoek et al. \u201cInput Warping for Bayesian Optimization of Non-stationary Functions\u201d.\nIn: arXiv:1402.0929 [cs, stat] (Feb. 2014). arXiv: 1402.0929.\n\n[20]\n\n[21] Maryam Soleimanzadeh and Rafael Wisniewski. \u201cController design for a wind farm, consider-\n\ning both power and load aspects\u201d. In: Mechatronics 21.4 (2011), pp. 720\u2013727.\n\n[22] Michalis K. Titsias. \u201cVariational Learning of Inducing Variables in Sparse Gaussian Processes.\u201d\n\nIn: AISTATS. Vol. 5. 2009, pp. 567\u2013574.\n\n[23] Michalis K. Titsias and Neil D. Lawrence. \u201cBayesian Gaussian process latent variable model\u201d.\n\nIn: International Conference on Arti\ufb01cial Intelligence and Statistics. 2010, pp. 844\u2013851.\n\n[24] Feng Zhou and Fernando De la Torre. \u201cGeneralized time warping for multi-modal alignment of\nhuman motion\u201d. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference\non. IEEE, 2012, pp. 1282\u20131289.\n\n10\n\n\f", "award": [], "sourceid": 3472, "authors": [{"given_name": "Markus", "family_name": "Kaiser", "institution": "Technical University Munich"}, {"given_name": "Clemens", "family_name": "Otte", "institution": "Siemens"}, {"given_name": "Thomas", "family_name": "Runkler", "institution": "Technical University of Munich"}, {"given_name": "Carl Henrik", "family_name": "Ek", "institution": "University of Bristol"}]}