{"title": "Multi-task Learning for Aggregated Data using Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 15076, "page_last": 15086, "abstract": "Aggregated data is commonplace in areas such as epidemiology and demography. For example, census data for a population is usually given as averages defined over time periods or spatial resolutions (cities, regions or countries). In this paper, we present a novel multi-task learning model based on Gaussian processes for joint learning of variables that have been aggregated at different input scales. Our model represents each task as the linear combination of the realizations of latent processes that are integrated at a different scale per task. We are then able to compute the cross-covariance between the different tasks either analytically or numerically. We also allow each task to have a potentially different likelihood model and provide a variational lower bound that can be optimised in a stochastic fashion making our model suitable for larger datasets. We show examples of the model in a synthetic example, a fertility dataset and an air pollution prediction application.", "full_text": "Multi-task Learning for Aggregated Data using\n\nGaussian Processes\n\nFariba Youse\ufb01\n\nMichael Thomas Smith\n\nMauricio A. \u00c1lvarez\n\nDepartment of Computer Science, University of Shef\ufb01eld\n\n{f.yousefi, m.t.smith, mauricio.alvarez}@sheffield.ac.uk\n\nAbstract\n\nAggregated data is commonplace in areas such as epidemiology and demography.\nFor example, census data for a population is usually given as averages de\ufb01ned over\ntime periods or spatial resolutions (cities, regions or countries). In this paper, we\npresent a novel multi-task learning model based on Gaussian processes for joint\nlearning of variables that have been aggregated at different input scales. Our model\nrepresents each task as the linear combination of the realizations of latent processes\nthat are integrated at a different scale per task. We are then able to compute the\ncross-covariance between the different tasks either analytically or numerically. We\nalso allow each task to have a potentially different likelihood model and provide a\nvariational lower bound that can be optimised in a stochastic fashion making our\nmodel suitable for larger datasets. We show examples of the model in a synthetic\nexample, a fertility dataset and an air pollution prediction application.\n\n1\n\nIntroduction\n\nMany datasets in \ufb01elds like ecology, epidemiology, remote sensing, sensor networks and demography\nappear naturally aggregated, that is, variables in these datasets are measured or collected in intervals,\nareas or supports of different shapes and sizes. For example, census data are usually sampled or\ncollected as aggregated at different administrative divisions, e.g. borough, town, postcode or city\nlevels. In sensor networks, correlated variables are measured at different resolutions or scales. In the\nnear future, air pollution monitoring across cities and regions could be done using a combination of a\nfew high-quality low time-resolution sensors and several low-quality (low-cost) high time-resolution\nsensors. Joint modelling of the variables registered in the census data or the variables measured using\ndifferent sensor con\ufb01gurations at different scales can improve predictions at the point or support\nlevels.\nIn this paper, we are interested in providing a general framework for multi-task learning on these\ntypes of datasets. Our motivation is to use multi-task learning to jointly learn models for different\ntasks where each task is de\ufb01ned at (potentially) a different support of any shape and size and has a\n(potentially) different nature, i.e. it is a continuous, binary, categorical or count variable. We appeal\nto the \ufb02exibility of Gaussian processes (GPs) for developing a prior over such type of datasets and\nwe also provide a scalable approach for variational Bayesian inference.\nGaussian processes have been used before for aggregated data [Smith et al., Law et al., 2018, Tanaka\net al., 2019] and also in the context of the related \ufb01eld of multiple instance learning [Kim and De la\nTorre, 2010, Kandemir et al., 2016, Hau\u00dfmann et al., 2017]. In multiple instance learning, each\ninstance in the dataset consists of a set (or bag) of inputs with only one output (or label) for that\nwhole set. The aim is to provide predictions at the level of individual inputs. Smith et al. provide\na new kernel function to handle single regression tasks de\ufb01ned at different supports. They use\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fcross-validation for hyperparameter selection. Law et al. [2018] use the weighted sum of a latent\nfunction evaluated at different inputs as the prior for the rate of a Poisson likelihood. The latent\nfunction follows a GP prior. The authors use stochastic variational inference (SVI) for approximating\nthe posterior distributions. Tanaka et al. [2019] mainly use GPs for creating data from different\nauxiliary sources. Furthermore, they only consider Gaussian regression and they do not include\ninducing variables. While Smith et al. and Law et al. [2018] perform the aggregation at the latent\nprior stage, Kim and De la Torre [2010], Kandemir et al. [2016] and Hau\u00dfmann et al. [2017] perform\nthe aggregation at the likelihood level. These three approaches target a binary classi\ufb01cation problem.\nBoth Kim and De la Torre [2010] and Hau\u00dfmann et al. [2017] focus on the case for which the label\nof the bag corresponds to the maximum of the (unobserved) individual labels of each input. Kim and\nDe la Torre [2010] approximate the maximum using a softmax function computed using a latent GP\nprior evaluated across the individual elements of the bag. They use the Laplace approximation for\ncomputing the approximated posterior [Rasmussen and Williams, 2006]. Hau\u00dfmann et al. [2017], on\nthe other hand, approximate the maximum using the so called bag label likelihood, introduced by the\nauthors, which is similar to a Bernoulli likelihood with soft labels given by a convex combination\nbetween the bag labels and the maximum of the (latent) individual labels. The latent individual labels\nin turn follow Bernoulli likelihoods with parameters given by a GP. The authors provide a variational\nbound and include inducing inputs for scalable Bayesian inference. Kandemir et al. [2016] follow\na similar approach to Law et al. [2018] equivalent to setting all the weights in Law et al.\u2019s model\nto one. The sum is then used to modulate the parameter of a Bernoulli likelihood that models the\nbag labels. They use a Fully Independent Training Conditional approximation for the latent GP\nprior [Snelson and Ghahramani, 2006]. In contrast to these previous works, we provide a multi-task\nlearning model for aggregated data that scales to large datasets and allows for heterogeneous outputs.\nAt the time of submission of this paper, the idea of using multi-task learning for aggregated datasets\nwas simultaneously proposed by Hamelijnck et al. and Tanaka et al., two additional models to the\none we propose in this paper. In our work, we allow heterogenous likelihoods which is different to\nboth Hamelijnck et al. and Tanaka et al.. We also allow an exact solution to the integration of the\nlatent function through the kernel in Smith et al., which is different to Hamelijnck et al.. Also, for\ncomputational complexity, inducing inputs are used, another difference from the work in Tanaka\net al.. Other relevant work is described in Section 3.\nFor building the multi-task learning model we appeal to the linear model of coregionalisation [Journel\nand Huijbregts, 1978, Goovaerts, 1997] that has gained popularity in the multi-task GP literature in\nrecent years [Bonilla et al., 2008, Alvarez et al., 2012]. We also allow different likelihood functions\n[Moreno-Mu\u00f1oz et al., 2018] and different input supports per individual task. Moreover, we introduce\ninducing inputs at the level of the underlying common set of latent functions, an idea initially\nproposed in Alvarez and Lawrence [2009]. We then use stochastic variational inference for GPs\n[Hensman et al., 2013] leading to an approximation similar to the one obtained in Moreno-Mu\u00f1oz\net al. [2018]. Empirical results show that the multi-task learning approach developed here provides\naccurate predictions in different challenging datasets where tasks have different supports.\n\n2 Multi-task learning for aggregated data at different scales\n\nIn this section we \ufb01rst de\ufb01ne the basic model in the single-task setting. We then extend the model\nto the multi-task setting and \ufb01nally provide details for the stochastic variational formulation for\napproximate Bayesian inference.\n\n2.1 Change of support using Gaussian processes\n\nChange of support has been studied in geostatistics before [Gotway and Young, 2002]. In this paper,\nwe use a formulation similar to Kyriakidis [2004]. We start by de\ufb01ning a stochastic process over the\ninput interval (xa, xb) using\n\n(cid:90) xb\n\nxa\n\nf (xa, xb) =\n\n1\n\u2206x\n\nu(z)dz,\n\nwhere u(z) is a latent stochastic process that we assume follows a Gaussian process with zero\nmean and covariance k(z, z(cid:48)) and \u2206x = |xb \u2212 xa|. Dividing by \u2206x helps to keep the proportion-\nality between the length of the interval and the area under u(z) in the interval. In other words,\nthe process f (\u00b7) is modeled as a density meaning that inputs with widely differing supports will\n\n2\n\n\f(cid:82) xb\n\nxa\n\n(cid:82) x(cid:48)\n\nb\n\nx(cid:48)\n1\n\na\n\n1\n\nb\n\na\n\n\u2206x\u2206x(cid:48)\n\nxa\n\nb)] =\n\na, x(cid:48)\n\na, x(cid:48)\n\n(cid:82) x(cid:48)\n\n(cid:82) xb\n\n\u2206x\u2206x(cid:48)\na, x(cid:48)\n\nb) to refer to cov[f (xa, xb), f (x(cid:48)\n\nbehave in a similar way. The \ufb01rst two moments for f (xa, xb) are given as E[f (xa, xb)] = 0 and\nE[f (xa, xb), f (x(cid:48)\nE[u(z)u(z(cid:48))]dz(cid:48)dz. The covariance for f (xa, xb) fol-\nb)] =\nlows as cov[f (xa, xb), f (x(cid:48)\nk(z, z(cid:48))dz(cid:48)dz since E[u(z)] = 0. Let us use\nx(cid:48)\na, x(cid:48)\nk(xa, xb, x(cid:48)\nb)]. We can now use these mean and covariance\nfunctions for representing the Gaussian process prior for f (xa, xb) \u223c GP(0, k(xa, xb, x(cid:48)\na, x(cid:48)\nb)).\nFor some forms of k(z, z(cid:48)) it is possible to obtain an analytical expression for k(xa, xb, x(cid:48)\na, x(cid:48)\nb).\nFor example, if k(z, z(cid:48)) follows an Exponentiated-Quadratic (EQ) covariance form, k(z, z(cid:48)) =\n\u03c32 exp{\u2212 (z\u2212z(cid:48))2\n}, where \u03c32 is the variance of the kernel and (cid:96) is the length-scale, it can be shown\nthat k(xa, xb, x(cid:48)\na, x(cid:48)\nb) follows as\na, x(cid:48)\nk(xa, xb, x(cid:48)\nb) =\n\u221a\n\n\u03c0z erf(z) + e\u2212z2 with erf(z), the error function de\ufb01ned as erf(z) = 2\u221a\n\nwhere h(z) =\nOther kernels for k(z, z(cid:48)) also lead to analytical expressions for k(xa, xb, x(cid:48)\nSmith et al..\nSo far, we have restricted the exposition to one-dimensional intervals. However, we can de\ufb01ne the\nstochastic process f over a general support \u03c5, with measure |\u03c5|, using\n\n(cid:18) xb \u2212 x(cid:48)\n(cid:82) z\n(cid:96)\n0 e\u2212r2\n\ndr.\nb). See for example\n\n(cid:18) xa \u2212 x(cid:48)\n\n(cid:18) xa \u2212 x(cid:48)\n\n(cid:18) xb \u2212 x(cid:48)\n\n(cid:19)(cid:21)\n\n2\u2206x\u2206x(cid:48)\n\na, x(cid:48)\n\n(cid:19)\n\n(cid:19)\n\n(cid:19)\n\na\n\n+ h\n\n\u2212 h\n\nb\n\n\u2212 h\n\n(cid:20)\n\nh\n\na\n\n(cid:96)\n\nb\n\n,\n\n\u03c32(cid:96)2\n\n(cid:96)2\n\n(cid:96)\n\n(cid:96)\n\n\u03c0\n\n(cid:90)\n\nf (\u03c5) =\n\nu(z)dz.\n\n1\n|\u03c5|\n\nz\u2208v\n\n(cid:82)\n\n(cid:82)\nThe support \u03c5 generally refers to an area or volume of any shape or size. Following similar assump-\ntions to the ones we used for f (xa, xb), we can build a GP prior to represent f (\u03c5) with covariance\nz(cid:48)\u2208\u03c5(cid:48) k(z, z(cid:48))dz(cid:48)dz. Let z \u2208 Rp. If the support \u03c5 has\nk(\u03c5, \u03c5(cid:48)) de\ufb01ned as k(\u03c5, \u03c5(cid:48)) = 1|\u03c5||\u03c5(cid:48)|\na regular shape, e.g. a hyperrectangle, then assumptions on u(z) such as additivity or factoriza-\ntion across input dimensions will lead to kernels that can be expressed as addition of kernels or\ni=1 ui(zi), where\nz = [z1,\u00b7\u00b7\u00b7 , zp](cid:62), and a GP over each ui(zi) \u223c GP(0, k(zi, z(cid:48)\ni) is an EQ ker-\ni,b) are the intervals\nacross each input dimension. If the support \u03c5 does not follow a regular shape, i.e it is a polytope,\nthen we can approximate the double integration by numerical integration inside the support.\n\nproduct of kernels acting over a single dimension. For example, let u(z) = (cid:81)p\nnel, then k(\u03c5, \u03c5(cid:48)) =(cid:81)p\n\ni,b), where (xi,a, xi,b) and (x(cid:48)\n\ni)). If each k(zi, z(cid:48)\n\ni=1 k(xi,a, xi,b, x(cid:48)\n\ni,a, x(cid:48)\n\ni,a, x(cid:48)\n\nz\u2208\u03c5\n\n2.2 Multi-task learning setting\n\nOur inspiration for multi-task learning is the linear model of coregionalisation (LMC) [Journel and\nHuijbregts, 1978]. This model has connections with other multi-task learning approaches that use\nkernel methods. For example, Teh et al. [2005] and Bonilla et al. [2008] are particular cases of\nLMC. A detailed review is available in Alvarez et al. [2012]. In the LMC, each output (or task in\nour case) is represented as a linear combination of a common set of latent Gaussian processes. Let\n{uq(z)}Q\nq=1 be a set of Q GPs with zero means and covariance functions kq(z, z(cid:48)). Each GP uq(z) is\nsampled independently and identically Rq times to produce {ui\ni=1,q=1 realizations that are used\nto represent the outputs. Let {fd(\u03c5)}D\nd=1 be a set of tasks where each task is de\ufb01ned at a different\nsupport \u03c5. We use the set of realizations ui\n\nq(z) to represent each task fd(\u03c5) as\n\nq(z)}Rq,Q\n\nQ(cid:88)\n\nRq(cid:88)\n\nq=1\n\ni=1\n\n(cid:90)\n\nai\nd,q|\u03c5|\n\nz\u2208v\n\n(cid:90)\n\nQ(cid:88)\n\nq=1\n\nbq\nd,d(cid:48)\n|\u03c5||\u03c5(cid:48)|\n\n(cid:90)\n\n3\n\nfd(\u03c5) =\n\nui\nq(z)dz,\n\n(1)\n\nwhere the coef\ufb01cients ai\ncov[ui\ncovariance kfd,fd(cid:48) (\u03c5, \u03c5(cid:48)) between fd(\u03c5) and fd(cid:48)(\u03c5(cid:48)) is then given as\n\nSince\nq(cid:48)(z(cid:48))] = kq(z, z(cid:48))\u03b4q,q(cid:48)\u03b4i,i(cid:48), with \u03b4\u03b1,\u03b2 the Kronecker delta between \u03b1 and \u03b2, the cross-\n\nd,q weight the contribution of each integral term to fd(\u03c5).\n\nq(z), ui(cid:48)\n\nkfd,fd(cid:48) (\u03c5, \u03c5(cid:48)) =\n\nkq(z, z(cid:48))dz(cid:48)dz,\n\nz\u2208\u03c5\n\nz(cid:48)\u2208\u03c5(cid:48)\n\n\fd,d(cid:48) =(cid:80)Rq\n\nq=1\n\nd(cid:48),qui\n\ni=1 ai\n\nd,qai\n\ni=1 ai\n\n(cid:80)Rq\n\nfd(cid:48)(x) =(cid:80)Q\nfd(cid:48)(x), kfd,fd(cid:48) (\u03c5, x), leading to, kfd,fd(cid:48) (\u03c5, x) =(cid:80)Q\n(cid:82) xb\n\nwhere bq\nd(cid:48),q. Following the discussion in Section 2.1, the double integral can be\nsolved analytically for some options of \u03c5, \u03c5(cid:48) and kq(z, z(cid:48)). Generally, a numerical approximation\ncan be obtained.\nIt is also worth mentioning at this point that the model does not require all the tasks to be de\ufb01ned at the\narea level. Some of the tasks could also be de\ufb01ned at the point level. Say for example that fd is de\ufb01ned\nat the support level \u03c5, fd(\u03c5), whereas fd(cid:48) is de\ufb01ned at the point level, say x \u2208 Rp, fd(cid:48)(x). In this case,\nq(x). We can still compute the cross-covariance between fd(\u03c5) and\nz\u2208v kq(z, x)dz. For the case Q = 1\nand p = 1 (i.e. dimensionality of the input space), this is, z, z(cid:48), x \u2208 R, \u03c5 = (xa, xb) and an EQ\nkernel for k(z, z(cid:48)), we get kfd,fd(cid:48) (\u03c5, x) = bd,d(cid:48)\n(we used \u03c32 = 1 to avoid an overparameterization for the variance). Again, if \u03c5 does not have a\nregular shape, we can approximate the integral numerically.\nLet us de\ufb01ne the vector-valued function f (\u03c5) = [f1(\u03c5),\u00b7\u00b7\u00b7 , fD(\u03c5)](cid:62). A GP prior over f (\u03c5) can use\nthe kernel de\ufb01ned above so that\nf (\u03c5) \u223c GP\n\n(cid:1) + erf(cid:0) x\u2212xa\n\n(cid:2)erf(cid:0) xb\u2212x\n\nk(z, x)dz = bd,d(cid:48) (cid:96)\n2\u2206x\n\nkq(z, z(cid:48))dz(cid:48)dz\n\nQ(cid:88)\n\n(cid:1)(cid:3)\n\nbq\nd,d(cid:48)\n|\u03c5|\n\n(cid:32)\n\n\u2206x\n\nxa\n\n(cid:33)\n\n,\n\n(cid:90)\n\n(cid:96)\n\n(cid:96)\n\n(cid:90)\n\n(cid:82)\n\nq=1\n\n1\n\n|\u03c5||\u03c5(cid:48)| Bq\n\n0,\n\nq=1\n\nz\u2208v\n\nz(cid:48)\u2208v(cid:48)\n\n(cid:82)\n\n(cid:82)\n\nThe scalar term\n\nz(cid:48)\u2208v(cid:48) kq(z, z(cid:48))dz(cid:48)dz modulates Bq as a function of \u03c5 and \u03c5(cid:48).\n\nwhere each Bq \u2208 RD\u00d7D is known as a coregionalisation matrix.\nz\u2208v\nThe prior above can be used for modulating the parameters of likelihood functions that model the\nobserved data. The most simple case corresponds to the multi-task regression problem that can be\nmodeled using a multivariate Gaussian distribution. Let y(\u03c5) = [y1(\u03c5),\u00b7\u00b7\u00b7 , yD(\u03c5)](cid:62) be a random\nvector modeling the observed data as a function of \u03c5. In the multi-task regression problem y(\u03c5) \u223c\nN (\u00b5(\u03c5), \u03a3), where \u00b5(\u03c5) = [\u00b51(\u03c5),\u00b7\u00b7\u00b7 , \u00b5D(\u03c5)](cid:62) is the mean vector and \u03a3 is a diagonal matrix\nwith entries {\u03c32\n}D\nd=1. We can use the GP prior f (\u03c5) as the prior over the mean vector \u00b5(\u03c5) \u223c f (\u03c5).\n(cid:82)\nSince both the likelihood and the prior are Gaussian, both the marginal distribution for y(\u03c5) and the\nposterior distribution of f (\u03c5) given y(\u03c5) can be computed analytically. For example, the marginal\nz(cid:48)\u2208v(cid:48) kq(z, z(cid:48))dz(cid:48)dz + \u03a3).\nMoreno-Mu\u00f1oz et al. [2018] introduced the idea of allowing each task to have a different likelihood\nfunction and modulated the parameters of that likelihood function using one or more elements in the\nvector-valued GP prior. For that general case, the marginal likelihood and the posterior distribution\ncannot be computed in closed form.\n\ndistribution for y(\u03c5) is given as y(\u03c5) \u223c N (0,(cid:80)Q\n\n1|\u03c5||\u03c5(cid:48)| Bq\n\n(cid:82)\n\nz\u2208v\n\nq=1\n\nyd\n\n2.3 Stochastic variational inference\nLet D = {\u03a5, y} be a dataset of multiple tasks with potentially different supports per task, where \u03a5 =\n{\u03c5d}D\nd=1, with \u03c5d = [\u03c5d,1,\u00b7\u00b7\u00b7 , \u03c5d,Nd ](cid:62), and y = [y1,\u00b7\u00b7\u00b7 , yD](cid:62), with yd = [yd,1,\u00b7\u00b7\u00b7 , yd,Nd ](cid:62)\nand yd,j = yd(\u03c5d,j). Notice that y without \u03c5 refers to the output vector for the dataset. We are inter-\nested in computing the posterior distribution p(f|y) = p(y|f )p(f )/p(y), where f = [f1,\u00b7\u00b7\u00b7 , fD](cid:62),\nwith fd = [fd,1,\u00b7\u00b7\u00b7 , fd,Nd ](cid:62) and fd,j = fd(\u03c5d,j). In this paper, we will use stochastic variational\ninference to compute a deterministic approximation of the posterior distribution p(f|y) \u2248 q(f ), by\nmeans of the the well known idea of inducing variables. In contrast to the use of SVI for traditional\nGaussian processes, where the inducing variables are de\ufb01ned at the level of the process f, we follow\n\u00c1lvarez et al. [2010] and Moreno-Mu\u00f1oz et al. [2018], and de\ufb01ne the inducing variables at the level\nof the latent processes uq(z). For simplicity in the notation, we assume Rq = 1. Let u = {uq}Q\nq=1 be\nthe set of inducing variables, where uq = [uq(z1),\u00b7\u00b7\u00b7 , uq(zM )](cid:62), with Z = {zm}M\nm=1 the inducing\ninputs. Notice also that we have used a common set of inducing inputs Z for all latent functions but\nwe can easily de\ufb01ne a set Zq per inducing vector uq.\nFor the multi-task regression case, it is possible to compute an analytical expression for the Gaussian\nposterior distribution over the inducing variables u, q(u), following a similar approach to \u00c1lvarez\net al. [2010]. However, such approximation is only valid for the case in which the likelihood model\np(y|f ) is Gaussian and the variational bound obtained is not amenable for stochastic optimisation. An\nalternative for \ufb01nding q(u) also establishes a lower-bound for the log-marginal likelihood log p(y),\n\n4\n\n\f(cid:90) (cid:90)\n\nlog p(y) \u2265\n\nq(f , u) log\n\np(y|f )p(f|u)p(u)\n\ndf du = L,\nq(f , u)\nuuu, K\ufb00 \u2212 KfuK\u22121\n\nbut uses numerical optimisation for maximising the bound with respect to the mean parameters, \u00b5, and\nthe covariance parameters, S, for the Gaussian distribution q(u) \u223c N (\u00b5, S) [Moreno-Mu\u00f1oz et al.,\n2018]. Such numerical procedure can be used for any likelihood model p(y|f ) and the optimisation\ncan be performed using mini-batches. We follow this approach.\n\nLower-bound The lower bound for the log-marginal likelihood follows as\n\n(cid:82)\nfu), and p(u) \u223c\nwhere q(f , u) = p(f|u)q(u), p(f|u) \u223c N (KfuK\u22121\nuuK(cid:62)\nN (0, Kuu) is the prior over the inducing variables. Here Kfu is a blockwise matrix with matrices\nKfd,uq. In turn each of these matrices have entries given by kfd,uq (\u03c5, z(cid:48)) = ad,q|\u03c5|\nz\u2208\u03c5 kq(z, z(cid:48))dz.\nSimilarly, Kuu is a block-diagonal matrix with blocks given by Kq with entries computed using\nkq(z, z(cid:48)). The optimal q(u) is chosen by numerically maximizing L with respect to the parameters\n\u00b5 and S. To ensure a valid covariance matrix S we optimise the Cholesky factor L for S = LL(cid:62).\nSee Appendix A.1 for more details on the lower bound. The computational complexity is similar to\nthe one for the model in Moreno-Mu\u00f1oz et al. [2018], O(QM 3 + JN QM 2), where J depends on\nthe types of likelihoods used for the different tasks. For example, if we model all the outputs using\nGaussian likelihoods, then J = D. For details, see Moreno-Mu\u00f1oz et al. [2018].\n\nHyperparameter learning When using the multi-task learning method, we need to optimise\nthe hyperparameters associated with the LMC, these are: the coregionalisation matrices Bq, the\nhyperparameters of the kernels kq(z, z(cid:48)), and any other hyperparameter associated to the likelihood\nfunctions p(y|f ) that has not been considered as a member of the latent vector f (\u03c5). Hyperparameter\noptimisation is done using the lower bound L as the objective function. First L is maximised with\nrespect to the variational distribution q(u) and then with respect to the hyperparameters. The two-\nsteps are repeated one after the other until reaching convergence. Such style of optimisation is known\nas variational EM (Expectation-Maximization) when using the full dataset [Beal, 2003] or stochastic\nversion, when employing mini-batches [Hoffman et al., 2013]. In the Expectation step we compute a\nvariational posterior distribution and in the Maximization step we use a variational lower bound to\n\ufb01nd point estimates of any hyperparameters. For optimising the hyperparameters in Bq, we also use\na Cholesky decomposition for each matrix to ensure positive de\ufb01niteness. So instead of optimising\nBq directly, we optimise Lq, where Bq = LqL(cid:62)\nq . For the experimental section, we use the EQ kernel\nfor kq(z, z), so we \ufb01x the variance for kq(z, z) to one (the variance per output is already contained in\nthe matrices Bq) and optimise the length-scales (cid:96)q.\n\np(y\u2217|y, \u03a5\u2217) is computed using p(y\u2217|y, \u03a5\u2217) = (cid:82)\n\nPredictive distribution Given a new set of test inputs \u03a5\u2217,\nthe predictive distribution for\nf\u2217 p(y\u2217|f\u2217)q(f\u2217)df\u2217, where y\u2217 and f\u2217 refer to\nthe vector-valued functions y and f evaluated at \u03a5\u2217. Notice that q(f\u2217) \u2248 p(f\u2217|y). Even though\ny does not appear explicitly in the expression for q(f\u2217), it has been used to compute the posterior\nfor q(u) through the optimisation of L where y is explicitly taken into account. We are usually\ninterested in the mean prediction E[y\u2217] and the predictive variance var[y\u2217]. Both can be computed\nby exchanging integrals in the double integration over y\u2217 and f\u2217. See Appendix A.1 for more details\non this.\n\n3 Related work\n\nMachine learning methods for different forms of aggregated datasets are also known under the names\nof multiple instance learning, learning from label proportions or weakly supervised learning on\naggregate outputs [K\u00fcck and de Freitas, 2005, Musicant et al., 2007, Quadrianto et al., 2009, Patrini\net al., 2014, Kotzias et al., 2015, Bhowmik et al., 2015]. Law et al. [2018] provided a summary of\nthese different approaches. Typically these methods start with the following setting: each instance in\nthe dataset is in the form of a set of inputs for which there is only one corresponding output (e.g. the\nproportion of class labels, an average or a sample statistic). The prediction problem usually consists\nthen in predicting the individual outputs for the individual inputs in the set. The setting we present in\nthis paper is slightly different in the sense that, in general, for each instance, the input corresponds to\na support of any shape and size and the output corresponds to a vector-valued output. Moreover, each\n\n5\n\n\f(cid:82)\n\nz\u2208\u03c5d,j\n\nq(xd,j,k).\nui\n\ntask can have its own support. Similarly, while most of these ML approaches have been developed for\neither regression or classi\ufb01cation, our model is built on top of Moreno-Mu\u00f1oz et al. [2018], allowing\neach task to have a potentially different likelihood.\nAs mentioned in the introduction, Gaussian processes have also been used for multiple instance\nlearning or aggregated data [Kim and De la Torre, 2010, Kandemir et al., 2016, Hau\u00dfmann et al.,\n2017, Smith et al., Law et al., 2018, Tanaka et al., 2019, Hamelijnck et al., Tanaka et al.]. Compared\nto most of these previous approaches, our model goes beyond the single task problem and allows\nlearning multiple tasks simultaneously. Each task can have its own support at training and test time.\nCompared to other multi-task approaches, we allow for heterogeneous outputs. Although our model\nwas formulated for a continuous support x \u2208 \u03c5d,j, we can also de\ufb01ne it in terms of a \ufb01nite set of\n(previously de\ufb01ned) inputs in the support, e.g. a set {xd,j,1,\u00b7\u00b7\u00b7 , xd,j,Kd,j} \u2208 \u03c5d,j which is more\nakin to the bag formulations in these previous works. This would require changing the integral\n1|\u03c5d,j|\nIn geostatistics, a similar problem has been studied under the names of downscaling or spatial\ndisaggregation [Zhang et al., 2014], particularly using different forms of kriging [Goovaerts, 1997].\nIt is also closely related to the problem of change of support described with detail in Gotway and\nYoung [2002]. Block-to-point kriging (or area-to-point kriging if the support is de\ufb01ned in a surface)\nis a common method for downscaling, this is, provide predictions at the point level provided data at\nthe block level [Kyriakidis, 2004, Goovaerts, 2010]. We extend the approach introduced in Kyriakidis\n[2004] later revisited by Goovaerts [2010] for count data, to the multi-task setting, including also a\nstochastic variational EM algorithm for scalable inference.\nIf we consider the high-resolution outputs as high-\ufb01delity outputs and low-resolution outputs as\nlow-\ufb01delity outputs, our work also falls under the umbrella of multi-\ufb01delity models where co-kriging\nusing the linear model of coregionalisation has also been used as an alternative [Peherstorfer et al.,\n2018, Fern\u00e1ndez-Godino et al.].\n\nq(z)dz in (1) for the sum 1\nui\nKd,j\n\n(cid:80)\u2200x\u2208\u03c5d,j\n\n4 Experiments\n\nIn this section, we apply the multi-task learning model for prediction in three different datasets:\na synthetic example for two tasks that each have a Poisson likelihood, a two-dimensional input\ndataset of fertility rates aggregated by year of conception and ages in Canada, and an air-pollution\nsensor network where one task corresponds to a high-accuracy, low-frequency particulate matter\nsensor and another task corresponds to a low-cost, low-accuracy, high resolution sensor. In these\nexamples, we use k-means clustering over the input data, with k = M, to initialise the values of the\ninducing inputs, Z, which are also kept \ufb01xed during optimisation. We assume the inducing inputs\nare points, but they could be de\ufb01ned as intervals or supports. For standard optimisation we used the\nLBFGS-B algorithm and when SVI was needed, the Adam optimiser, included in climin library, was\nused for the optimisation of the variational distribution (variational E-step) and the hyperparameters\n(variational M-step). The implementation is based on the GPy framework and is available on Github:\nhttps://github.com/frb-yousefi/aggregated-multitask-gp.\n\nSynthetic data In this section we evaluated our model with a synthetic dataset. For all of the\nexperiments we use Q = 1 with an EQ covariance for the latent function u1(z). We set up a toy\nproblem with D = 2 tasks, where both likelihood functions are Poisson. We sample from the latent\nvector-valued GP and use those samples to modulate the Poisson likelihoods using exp(f1(\u00b7)) and\nexp(f2(\u00b7)) as the respective rates. The \ufb01rst task is generated using intervals of \u03c51 = 1 units, whereas\nthe second task is generated using intervals of \u03c52 = 2 units. All the inputs are uniformly distributed\nin the range [0, 250]. We generated 250 observations for task 1 and 125 for task 2. For training the\nmulti-task model, we select N1 = 200 from the 250 observations for task 1 and use all N2 = 125 for\nthe second task. The other 50 data points for task 1 correspond to a gap in the interval [130, 180] that\nwe use as the test set. In this experiment, we evaluated our model\u2019s capability in predicting one task,\nsampled more frequently, using the training information from a second task with a larger support.\nIn Figure 1 we show that the data in the second task, with a larger support, helps predicting the test\ndata in the gap present in the \ufb01rst task, with a smaller support (right panel). However, this is not the\ncase in the single task learning scenario where the predictions are basically constant and equal to 1\n(left panel). Both models predict the training data equally well. SMSE (Standardized Mean Squared\n\n6\n\n\f(a) Single-task learning\n\n(b) Multi-task learning\n\nFigure 1: Counts for the Poisson likelihoods and predictions using the single-task vs multi-task\nmodels. Predictions are shown only for the \ufb01rst task (the one with support of \u03c51 = 1). The blue bars\nare the original one-unit support data, the green bars are the predicted training count data and the red\nbars are the predicted test results in the gap [130, 180]. We did not include the two-unit support data\n(the second task) for clarity in the visualisation. The multi-task \ufb01gure on the right (b) is illustrated\nagain in Appendix A.4, Figure 7 for better visualisation.\n\nError) and SNLP (standardized negative log probability density) are calculated for \ufb01ve independent\nruns. For the multi-task scenario they are 0.464 \u00b1 0.136 and \u22120.822 \u00b1 0.109 and for the single task\ncase they are 0.9699 \u00b1 0.016 and \u22120.095 \u00b1 0.036, respectively.\n\nFertility rates from a Canadian census\nIn this experiment, a subset of the Canadian fertility\ndataset is used from the Human Fertility Database (HFD) 1. The dataset consists of live births\u2019\nstatistics by year, age of mother and birth order. The ages of the mother are between [15, 54] and the\nyears are between [1944, 2009]. It contains 2640 data points of fertility rate per birth order (the output\nvariable) and there are four birth orders. We used the 2640 data points of the 1st birth only. The\ndataset was randomly split into 1640 training points and 1000 test points. We consider two tasks: the\n\ufb01rst task consists of a different number of data observations randomly taken from the 1640 training\npoints. The second task consists of all the training data aggregated at two different resolutions, 5 \u00d7 5\nand 2 \u00d7 2 (we wanted to test the predictive performance when the relation of high-resolution data to\nlow-resolution data was 12 to 52 and another for 12 to 22). The aggregated data for the 5 \u00d7 5 case\n(a squared support of 5 years for the input age times 5 years for the input years of the study) is\nreduced to 104 data points and the aggregated data for 2 \u00d7 2 case is reduced to 660 points.\n\nFigure 2: SMSE plots for the fertility dataset for 5 \u00d7 5 (left panel) and 2 \u00d7 2 (right panel) aggregated\ndata. The Figure shows the performance in terms of the number of training instances used for the\ndata sampled at a higher resolution. The test set always contains 1000 instances. We plot the mean\nand standard deviation for \ufb01ve repetitions of the experiment with different sets of training and test\ndata. Appendix A.2 shows the same plots for SNLP. Appendix A.3 illustrates other experimental\nbaselines that are compared to eachother for the same metrics. Further experiments considering more\ntasks are also included in Appendix A.4.\n\nIn the experiments, we train this multi-task model by slowly increasing the original resolution training\ndata, while maintaining a \ufb01xed amount of training points mentioned before for the aggregated second\ntask. The output variable (fertility rate for the \ufb01rst birth) was assumed to be Gaussian, so both\ntasks follow a Gaussian likelihood. We use Q = 1 with an EQ kernel k1(z, z(cid:48)) with z \u2208 R2 where\n\n1https://www.humanfertility.org\n\n7\n\n050100150200250Input02468Count050100150200250Input02468Count510305080100500Number of samples for high-resolution output0.00.20.40.60.81.0SMSESingleMulti510305080100500Number of samples for high-resolution output0.00.20.40.60.81.0SMSESingleMulti\fthe two input variables are age of mother and birth year. We used 100 \ufb01xed inducing variables\nand mini-batches of size 50 samples. The prediction task consists of predicting the 1000 original\nresolution test data with the help of the second task which consists of the aggregated data (5 \u00d7 5 or\n2 \u00d7 2 for two separate experiments).\nFigure 2 shows SMSE for \ufb01ve random selections of data points in the training and test sets. We notice\nthat the multi-task learning model outperforms the single-task GP when there are few observations\nin the task with the original resolution data. This pattern holds below 500 observations. At that\npoint, both models perform equally well since the single-task GP now has enough training data. With\nrespect to the two different resolutions, the performance of the multi-task model is better when the\nsecond task has a 2 \u00d7 2 resolution rather than 5 \u00d7 5 resolution, as one might also expect.\n\nAir pollution monitoring network Particulate air pollution can be measured accurately with high\ntemporal precision by using a \u03b2 attenuation (BAM) sensor or similar. Unfortunately these are often\nprohibitively expensive. We propose instead that one can combine the measurements from a low-cost\noptical particle counter (OPC) which gives good temporal resolution but is often badly biased, with\nthe results of a Cascade Impactors (CIs), which are a cheaper method for assessing the mass of\nparticulate species but integrate over many hours (e.g. 6 or 24 hours).\n\nFigure 3: Upper plot: a (biased) OPC low-accuracy high-frequency measurement of PM2.5 air\npollution. Lower plot: the high-precision low-frequency training data (black rectangles) the test data\nfrom the same instrument (red) and the posterior prediction for this output variable, predicting over\nthe same 15-minute periods as the test data (blue, with pale blue indicating 95% con\ufb01dence intervals).\nThe ticks in the bottom of the lower plot indicate the position of the inducing inputs. Also, we have\ndeliberately cut the higher peaks of the samples in the upper plot that can go as high as 500 \u00b5g/m3,\njust to be able to visualise better the samples in other parts of the plot.\n\nOne can formulate the problem as observations of integrals of a latent function. The CI integrating\nover 6 hour periods while the OPC sensor integrating over short 5 minute periods. We used data from\ntwo \ufb01ne particulate matter (PM) sensors. The sensors are less than 2.5 micrometer diameter (PM2.5)\nand are colocated in Kampala, Uganda at 0.3073\u25e6N 32.6205\u25e6E. The data is taken between 2019-03-13\nand 2019-03-22. We used the average of six-hour periods from a calibrated mcerts-veri\ufb01ed Osiris\n(Turnkey) particulate air pollution monitoring system as the low-resolution data, and then compared\nthe prediction results to the original measurements (available at a 15 minute resolution). We used a\nPMS 5003 (Plantower) low-cost OPC to provide the high-resolution data. Typically we found these\n\n8\n\n020406080PM2.5 (bias) / gm3Colocated low-cost sensor7580859095Time / hour05101520PM2.5 / gm3Reference sensor data and predictionpredictiontruetraining data\fvalues would often be biased. We simply normalised (scaled) the data to emphasise that the absolute\nvalues of these variables are not of interest in this model.\nOur multi-task model consists of a single latent function, Q = 1, with covariance k1(z, z(cid:48)) that\nfollows an EQ form. We assume both outputs follow Gaussian likelihoods. In our model, one\ntask represents the high accuracy low-resolution samples and the second task represents the low-\naccuracy high-resolution samples. The posterior GP both aims to ful\ufb01l the 6-hour long integrals of the\nhigh-accuracy data (from the Osiris instrument) while remaining correlated with the high-frequency\nbias data from the OPC. We used 2000 iterations of the variational EM algorithm, with 200 evenly\nspaced inducing points and a \ufb01xed lengthscale of 0.75 hours. We only optimise the parameters of the\ncoregionalisation matrix B1 \u2208 R2\u00d72 and the variance of the noise of each Gaussian likelihood.\nFigure 3 illustrates the results for a 24 hour period. The training data consists of the high-resolution\nlow-accuracy sensor and a low-frequency high accuracy sensor. The aim is to reconstruct the\nunderlying level of pollution both sensors are measuring. To test whether the additional high-\nfrequency data improves the accuracy we ran the coregionalisation both with and without this\nadditional training data.\nWe found that the SMSE for the predictions over the 9 days tested were substantially smaller\nwith multi-task learning compared to using only the low-resolution samples, 0.439 \u00b1 0.114 and\n0.657 \u00b1 0.100 respectively (the difference is statistical signi\ufb01cant using a paired t-test with a p value\nof 0.0008). In summary, the model was able to incorporate this additional data and use it to improve\nthe estimates while still ensuring the long integrals were largely satis\ufb01ed.\n\n5 Conclusion\n\nIn this paper, we have introduced a powerful framework for working with aggregated datasets that\nallows the user to combine observations from disparate data types, with varied support. This allows\nus to produce both \ufb01nely resolved and accurate predictions by using the accuracy of low-resolution\ndata and the \ufb01delity of high-resolution side-information. We chose our inducing points to lie in the\nlatent space, a distinction which allows us to estimate multiple tasks with different likelihoods. SVI\nand variational-EM with mini-batches make the framework scalable and tractable for potentially\nvery large problems. A potential extension would be to consider how the \u201cmixing\u201d achieved through\ncoregionalisation could vary across the domain by extending, for example, the Gaussian Process\nRegression Network model [Wilson et al., 2012] to be able to deal with aggregated data. Such\nmodel would allow latent functions of different lengthscales to be relevant at different locations in\nthe domain. In summary, this framework provides a vital toolkit, allowing a mixture of likelihoods,\nkernels and tasks and paves the way to the analysis of a very common and widely used data structure\n- that of values over a variety of supports on the domain.\n\n6 Acknowledgement\n\nMTS and MAA have been \ufb01nanced by the Engineering and Physical Research Council (EPSRC)\nResearch Project EP/N014162/1. MAA has also been \ufb01nanced by the EPSRC Research Project\nEP/R034303/1.\n\nReferences\nMauricio \u00c1lvarez, David Luengo, Michalis Titsias, and Neil Lawrence. Ef\ufb01cient multioutput Gaussian processes\n\nthrough variational inducing kernels. In AISTATS, pages 25\u201332, 2010.\n\nMauricio A. Alvarez and Neil D. Lawrence. Sparse convolved Gaussian processes for multi-output regression.\n\nIn NIPS, pages 57\u201364. 2009.\n\nMauricio A. Alvarez, Lorenzo Rosasco, Neil D. Lawrence, et al. Kernels for vector-valued functions: A review.\n\nFoundations and Trends in Machine Learning, pages 195\u2013266, 2012.\n\nMatthew J. Beal. Variational algorithms for approximate Bayesian inference. Ph. D. Thesis, University College\n\nLondon, 2003.\n\nAvradeep Bhowmik, Joydeep Ghosh, and Oluwasanmi Koyejo. Generalized linear models for aggregated data.\n\nIn AISTATS, pages 93\u2013101, 2015.\n\n9\n\n\fEdwin V. Bonilla, Kian M. Chai, and Christopher Williams. Multi-task Gaussian process prediction. In NIPS,\n\npages 153\u2013160. 2008.\n\nPhillip Boyle and Marcus Frean. Dependent Gaussian processes. In NIPS, pages 217\u2013224. 2005.\n\nM. Giselle Fern\u00e1ndez-Godino, Chanyoung Park, Nam-Ho Kim, and Raphael T. Haftka. Review of multi-\ufb01delity\n\nmodels. arXiv:1609.07196, 2016.\n\nPierre Goovaerts. Geostatistics for natural resources evaluation. Oxford University Press, 1997.\n\nPierre Goovaerts. Combining areal and point data in geostatistical interpolation: Applications to soil science and\n\nmedical geography. Mathematical Geosciences, pages 535\u2013554, 2010.\n\nCarol A. Gotway and Linda J. Young. Combining incompatible spatial data. Journal of the American Statistical\n\nAssociation, pages 632\u2013648, 2002.\n\nOliver Hamelijnck, Theodoros Damoulas, Kangrui Wang, and Mark Girolami. Multi-resolution multi-task\n\nGaussian processes. arXiv:1906.08344, 2019.\n\nManuel Hau\u00dfmann, Fred A. Hamprecht, and Melih Kandemir. Variational Bayesian multiple instance learning\n\nwith Gaussian processes. In CVPR, pages 810\u2013819, 2017.\n\nJames Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for big data. In UAI, pages 282\u2013290,\n\n2013.\n\nJames Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classi\ufb01-\n\ncation. In AISTATS, page 351\u2013360, 2015.\n\nMatthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, pages 1303\u20131347, 2013.\n\nAndre G. Journel and Charles J. Huijbregts. Mining Geostatistics. Academic Press, 1978.\n\nMelih Kandemir, Manuel Haussmann, Ferran Diego, Kumar T. Rajamani, Jeroen van der Laak, and Fred A.\nHamprecht. Variational weakly supervised Gaussian processes. In Proceedings of the British Machine Vision\nConference (BMVC), 2016.\n\nMinyoung Kim and Fernando De la Torre. Gaussian processes multiple instance learning. In ICML, pages\n\n535\u2013542, 2010.\n\nDimitrios Kotzias, Misha Denil, Nando de Freitas, and Padhraic Smyth. From group to individual labels using\ndeep features. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, pages 597\u2013606, 2015.\n\nHendrik K\u00fcck and Nando de Freitas. Learning about individuals from group statistics. In UAI, pages 332\u2013339,\n\n2005.\n\nPhaedon C. Kyriakidis. A geostatistical framework for area-to-point spatial interpolation. Geographical Analysis,\n\npages 259\u2013289, 2004.\n\nHo Chung Leon Law, Dino Sejdinovic, Ewan Cameron, Tim C.D. Lucas, Seth Flaxman, Katherine Battle, and\nKenji Fukumizu. Variational learning on aggregate outputs with Gaussian processes. In NeurIPS, pages\n6084\u20136094, 2018.\n\nPablo Moreno-Mu\u00f1oz, Antonio Art\u00e9s, and Mauricio A. \u00c1lvarez. Heterogeneous multi-output Gaussian process\n\nprediction. In NeurIPS, pages 6711\u20136720. 2018.\n\nDavid R. Musicant, Janara M. Christensen, and Jamie F. Olson. Supervised learning by training on aggregate\n\noutputs. In Seventh IEEE International Conference on Data Mining (ICDM), pages 252\u2013261, 2007.\n\nGiorgio Patrini, Richard Nock, Paul Rivera, and Tiberio Caetano. (almost) no label no cry. In NIPS, pages\n\n190\u2013198. 2014.\n\nBenjamin Peherstorfer, Karen Willcox, and Max Gunzburger. Survey of Multi\ufb01delity Methods in Uncertainty\n\nPropagation, Inference, and Optimization. SIAM Review, 60(3):550\u2013591, 2018.\n\nNovi Quadrianto, Alex J. Smola, Tiberio S. Caetano, and Quoc V. Le. Estimating labels from label proportions.\n\nJ. Mach. Learn. Res., pages 2349\u20132374, 2009.\n\nCarl E. Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006.\n\n10\n\n\fAlan D. Saul, James Hensman, Aki Vehtari, and Neil D. Lawrence. Chained Gaussian processes. In AISTATS,\n\npage 1431\u20131440, 2016.\n\nMichael T. Smith, Mauricio A. \u00c1lvarez, and Neil D Lawrence. Gaussian process regression for binned data.\n\narXiv:1809.02010, 2018.\n\nEdward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In NIPS, pages\n\n1257\u20131264. 2006.\n\nYusuke Tanaka, Toshiyuki Tanaka, Tomoharu Iwata, Takeshi Kurashima, Maya Okawa, Yasunori Akagi, and\nHiroyuki Toda. Spatially aggregated Gaussian processes with multivariate areal outputs. arXiv:1907.08350,\n2019.\n\nYusuke Tanaka, Tomoharu Iwata, Toshiyuki Tanaka, Takeshi Kurashima, Maya Okawa, and Hiroyuki Toda.\nRe\ufb01ning coarse-grained spatial data using auxiliary spatial data sets with various granularities. In Proceedings\nof the AAAI Conference on Arti\ufb01cial Intelligence, pages 5091\u20135099, 2019.\n\nYee-Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models. In AISTATS, page\n\n333\u2013340, 2005.\n\nAndrew G. Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In\n\nICML, 2012.\n\nJingxiong Zhang, Peter Atkinson, and Michael F. Goodchild. Scale in Spatial Information and Analysis. CRC\n\nPress, 2014.\n\n11\n\n\f", "award": [], "sourceid": 8621, "authors": [{"given_name": "Fariba", "family_name": "Yousefi", "institution": "University of Sheffield"}, {"given_name": "Michael", "family_name": "Smith", "institution": "University of Sheffield"}, {"given_name": "Mauricio", "family_name": "\u00c1lvarez", "institution": "University of Sheffield"}]}