{"title": "Spatially Aggregated Gaussian Processes with Multivariate Areal Outputs", "book": "Advances in Neural Information Processing Systems", "page_first": 3005, "page_last": 3015, "abstract": "We propose a probabilistic model for inferring the multivariate function from multiple areal data sets with various granularities. Here, the areal data are observed not at location points but at regions. Existing regression-based models can only utilize the sufficiently fine-grained auxiliary data sets on the same domain (e.g., a city). With the proposed model, the functions for respective areal data sets are assumed to be a multivariate dependent Gaussian process (GP) that is modeled as a linear mixing of independent latent GPs. Sharing of latent GPs across multiple areal data sets allows us to effectively estimate the spatial correlation for each areal data set; moreover it can easily be extended to transfer learning across multiple domains. To handle the multivariate areal data, we design an observation model with a spatial aggregation process for each areal data set, which is an integral of the mixed GP over the corresponding region. By deriving the posterior GP, we can predict the data value at any location point by considering the spatial correlations and the dependences between areal data sets, simultaneously. Our experiments on real-world data sets demonstrate that our model can 1) accurately refine coarse-grained areal data, and 2) offer performance improvements by using the areal data sets from multiple domains.", "full_text": "Spatially Aggregated Gaussian Processes\n\nwith Multivariate Areal Outputs\n\nYusuke Tanaka1,3, Toshiyuki Tanaka3, Tomoharu Iwata2, Takeshi Kurashima1,\n\nMaya Okawa1, Yasunori Akagi1, Hiroyuki Toda1\n\n1NTT Service Evolution Labs., 2NTT Communication Science Labs., 3Kyoto University\n\n{yusuke.tanaka.rh,tomoharu.iwata.gy,takeshi.kurashima.uf, maya.ookawa.af,\n\nyasunori.akagi.cu,hiroyuki.toda.xb}@hco.ntt.co.jp, tt@i.kyoto-u.ac.jp\n\nAbstract\n\nWe propose a probabilistic model for inferring the multivariate function from\nmultiple areal data sets with various granularities. Here, the areal data are observed\nnot at location points but at regions. Existing regression-based models can only\nutilize the suf\ufb01ciently \ufb01ne-grained auxiliary data sets on the same domain (e.g.,\na city). With the proposed model, the functions for respective areal data sets are\nassumed to be a multivariate dependent Gaussian process (GP) that is modeled as\na linear mixing of independent latent GPs. Sharing of latent GPs across multiple\nareal data sets allows us to effectively estimate the spatial correlation for each areal\ndata set; moreover it can easily be extended to transfer learning across multiple\ndomains. To handle the multivariate areal data, we design an observation model\nwith a spatial aggregation process for each areal data set, which is an integral of\nthe mixed GP over the corresponding region. By deriving the posterior GP, we can\npredict the data value at any location point by considering the spatial correlations\nand the dependences between areal data sets, simultaneously. Our experiments on\nreal-world data sets demonstrate that our model can 1) accurately re\ufb01ne coarse-\ngrained areal data, and 2) offer performance improvements by using the areal data\nsets from multiple domains.\n\n1\n\nIntroduction\n\nGovernments and other organizations are now collecting data from cities on items such as poverty\nrate, air pollution, crime, energy consumption, and traf\ufb01c \ufb02ow. These data play a crucial role in\nimproving the life quality of citizens in many aspects including socio-economics [23, 24], public\nsecurity [2, 32], public health [12], and urban planning [38]. For instance, the spatial distribution of\npoverty is helpful in identifying key regions that require intervention in a city; it makes it easier to\noptimize resource allocation for remedial action.\nIn practice, the data collected from cities are often spatially aggregated, e.g.,\naveraged over a region; thus only areal data are available; observations are not\nassociated with location points but with regions. Figure 1 shows an example\nof areal data, which is the distribution of poverty rate in New York City, where\ndarker hues represent regions with higher rates. This poverty rate data set was\nactually obtained via household surveys taken over the whole city. The survey\nresults are aggregated over prede\ufb01ned regions [24]. The problem addressed\nherein is to infer the function from the areal data; once we have the function\nwe can predict data values at any location point. Solving this problem allows\nus to obtain spatially-speci\ufb01c information about cities; it is useful for \ufb01nding\nkey pin-point regions ef\ufb01ciently.\n\nFigure 1: Areal data\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOne promising approach to address this problem is to utilize a wide variety of data sets from the\nsame domain (e.g., a city). Existing regression-based models learn relationships between target data\nand auxiliary data sets [14, 18, 21, 24, 28]. These models, however, assume that the auxiliary data\nsets have suf\ufb01ciently \ufb01ne spatial granularity (e.g., 1 km \u21e5 1 km grid cells); unfortunately, many areal\ndata sets are actually associated with large regions (e.g., zip code and police precinct). These models\ncannot then make full use of the coarse-grained auxiliary data sets. Another important drawback of\nall the prior works is that their performance in re\ufb01ning the areal data is suspect if we have only a few\ndata sets available for the domain.\nIn this paper, we propose a probabilistic model, called Spatially Aggregated Gaussian Processes\n(SAGP) herein after, that can infer the multivariate function from multiple areal data sets with various\ngranularities. In SAGP, the functions for the areal data sets are assumed to be a multivariate dependent\nGaussian process (GP) that is modeled as a linear mixing of independent latent GPs. The latent GPs\nare shared among all areal data sets in the target domain, which is expected to effectively learn the\nspatial correlation for each data set even if the number of observations in a data set is small; that is, a\ndata set associated with a coarse-grained region. Since the areal data are identi\ufb01ed by regions, not by\nlocation points, we introduce an observation model with the spatial aggregation process, in which\nareal observations are assumed to be calculated by integrating the mixed GP over the corresponding\nregion; then the covariance between regions is given by the double integral of the covariance function\nover the corresponding pair of regions. This allows us to accurately evaluate the covariance between\nregions from a consideration of region shape. Thus the proposal is very helpful if there are irregularly\nshaped regions (e.g., extremely elongated) in the input data.\nThe mechanism adopted in SAGP for sharing latent processes is also advantageous in that it makes\nit straightforward to utilize data sets from multiple domains. This allows our model to learn the\nspatial correlation for each data set by sharing the latent GPs among all areal data sets from multiple\ndomains; SAGP remains applicable even if we have only a few data sets available for a single domain.\nThe inference of SAGP is based on a Bayesian inference procedure. The model parameters can be\nestimated by maximizing the marginal likelihood, in which all the GPs are analytically integrated\nout. By deriving the posterior GP, we can predict the data value at any location point considering the\nspatial correlations and the dependences between areal data sets, simultaneously.\nThe major contributions of this paper are as follows:\n\nincorporates aggregation processes for handling multivariate areal data.\n\n\u2022 We propose SAGP, a novel multivariate GP model that is de\ufb01ned by mixed latent GPs; it\n\u2022 We develop a parameter estimation procedure based on the marginal likelihood in which latent\nGPs are analytically integrated out. This is the \ufb01rst explicit derivation of the posterior GP given\nmultivariate areal data; it allows the prediction of values at any location point.\n\n\u2022 We conduct experiments on multiple real-world data sets from urban cities; the results show\nthat our model can 1) accurately re\ufb01ne coarse-grained areal data, and 2) improve re\ufb01nement\nperformances by utilizing areal data sets from multiple cities.\n\n2 Related Work\n\nRelated works can be roughly categorized into two approaches: 1) regression-based model and\n2) multivariate model. The major difference between them is as follows: Denoting yt and y as\ntarget data and auxiliary data, respectively, the aim of the \ufb01rst approach is to design a conditional\ndistribution p(yt|y); the second approach designs a joint distribution p(yt, y).\nRegression-based models. A related problem has been addressed in the spatial statistics community\nunder the name of downscaling, spatial disaggregation, areal interpolation, or \ufb01ne-scale modeling [8],\nand this has attracted great interest in many disciplines such as socio-economics [2, 24], agricultural\neconomics [11, 36], epidemiology [27], meteorology [33, 35], and geographical information systems\n(GIS) [7]. The problem of predicting point-referenced data from areal observations is also related to\nthe change of support problem in geostatistics [8]. Regression-based models have been developed for\nre\ufb01ning coarse-grained target data via the use of multiple auxiliary data sets that have \ufb01ne granularity\n(e.g., 1 km \u21e5 1 km grid cells) [18, 21]. These models learn the regression coef\ufb01cients for the auxiliary\ndata sets under the spatial aggregation constraints that encourage consistency between \ufb01ne- and\ncoarse-grained target data. The aggregation constraints have been incorporated via block kriging [5]\n\n2\n\n\for transformations of Gaussian process (GP) priors [19, 25]. There have been a number of advanced\nmodels that offer a fully Bayesian inference [13, 29, 34] or a variational inference [14] for model\nparameters. The task addressed in these works is to re\ufb01ne the coarse-grained target data on the\nassumption that the \ufb01ne-grained auxiliary data are available; however, the areal data available on a\ncity are actually associated with various geographical partitions (e.g., police precinct), thus one might\nnot be able to obtain the \ufb01ne-grained auxiliary data. In that case, these models cannot make full use\nof the auxiliary data sets with various granularities, which contain the coarse-grained auxiliary data.\nA GP-based model was recently proposed for re\ufb01ning coarse-grained areal data by utilizing auxiliary\ndata sets with various granularities [28]. In this model, GP regression is \ufb01rst applied to each auxiliary\ndata set for deriving a predictive distribution de\ufb01ned on the continuous space; this conceptually\ncorresponds to spatial interpolation. By hierarchically incorporating the predictive distributions\ninto the model, the regression coef\ufb01cients can be learned on the basis of not only the strength of\nrelationship with the target data but also the level of spatial granularity. A major disadvantage of\nthis model is that the spatial interpolation is separately conducted for each auxiliary data set, which\nmakes it dif\ufb01cult to accurately interpolate the coarse-grained auxiliary data due to the data sparsity\nissue; this model fails to fully use the coarse-grained data.\nIn addition, these regression-based models (e.g., [14, 21, 28]) do not consider the spatial aggregation\nconstraints for the auxiliary data sets. This is a critical issue in estimating the multivariate function\nfrom multiple areal data sets, the problem focused in this paper.\nDifferent from the regression-based models, we design a joint distribution that incorporates the\nspatial aggregation process for all areal data sets (i.e., for both target and auxiliary data sets). The\nproposed model infers the multivariate function while considering the spatial aggregation constraints\nfor respective areal data sets. This allows us to effectively utilize all areal data sets with various\ngranularities for the data re\ufb01nement even if some auxiliary data sets have coarse granularity.\nMultivariate models. The proposed model builds closely upon recent studies in multivariate spatial\nmodeling, which model the joint distribution of multiple outputs. Many geostatistics studies use the\nclassical method of co-kriging for predicting multivariate spatial data [20]; this method is, however,\nproblematic in that it is unclear how to de\ufb01ne cross-covariance functions that determine the depen-\ndences between data sets [6]. In the machine learning community, there has been growing interest in\nmultivariate GPs [22], in which dependences between data sets are introduced via methodologies\nsuch as process convolution [4, 10], latent factor modeling [16, 30], and multi-task learning [3, 17].\nThe linear model of coregionalization (LMC) is one of the most widely-used approaches for con-\nstructing a multivariate function; the outputs are expressed as linear combinations of independent\nlatent functions [39]. The semiparametric latent factor model (SLFM) is an instance of LMC, in\nwhich latent functions are de\ufb01ned by GPs [30]. Unfortunately, these multivariate models cannot be\nstraightforwardly used for modeling the areal data we focus on, because they assume that the data\nsamples are observed at location points; namely they do not have an essential mechanism, i.e., the\nspatial aggregation constraints, for handling data that has been aggregated over regions.\nThe proposed model is an extension of SLFM. To handle the multivariate areal data, we newly\nintroduce an observation model with the spatial aggregation process for all areal data sets; this is\nrepresented by the integral of the mixed GP over each corresponding region, as in block kriging. We\nalso derive the posterior GP, which enables us to obtain the multivariate function from the observed\nareal data sets. Furthermore, the sharing of key information (i.e., covariance function) can be used for\ntransfer learning across a wide variety of areal data sets; this allows our model to robustly estimate\nthe spatial correlations for areal data sets and to support areal data sets from multiple domains.\nMulti-task GP models have recently and independently been proposed for addressing similar prob-\nlems [9, 37]. Main differences of our work from them are as follows: 1) Explicit derivation of the\nposterior GP given multivariate areal data; 2) transfer learning across multiple domains; 3) extensive\nexperiments on real-world data sets de\ufb01ned on the two-dimensional input space.\n\n3 Proposed Model\n\nWe propose SAGP (Spatially Aggregated Gaussian Processes), a probabilistic model for inferring the\nmultivariate function from areal data sets with various granularities. We \ufb01rst consider a formulation\nin the case of a single domain, then we mention an extension to the case of multiple domains.\n\n3\n\n\f(a) The case of a single domain.\n\n(b) The case of two domains.\n\nFigure 2: Graphical model representation of SAGP.\n\nAreal data. We start by describing the areal data this study focuses on. For simplicity, let us consider\nthe case of a single domain (e.g., a city). Assume that we have a wide variety of areal data sets\nfrom the same domain and each data set is associated with one of the geographical partitions that\nhave various granularities. Let S be the number of kinds of areal data sets. Let X\u21e2 R2 denote an\ninput space (e.g., a total region of a city), and x 2X denote an input variable, represented by its\ncoordinates (e.g., latitude and longitude). For s = 1, . . . , S, the partition Ps of X is a collection of\ndisjoint subsets, called regions, of X . Let |Ps| be the number of regions in Ps. For n = 1, . . . ,|Ps|,\nlet Rs,n 2P s denote the n-th region in Ps. Each areal observation is represented by the pair\n(Rs,n, ys,n), where ys,n 2 R is a value associated with the n-th region Rs,n. Suppose that we have\nS areal data sets {(Rs,n, ys,n) | s = 1, . . . , S; n = 1, . . . ,|Ps|}.\nFormulation for the case of a single domain. In the proposed model, the functions for the respective\nareal data sets on the continuous space are assumed to be the dependent Gaussian process (GP)\nwith multivariate outputs. We \ufb01rst construct the multivariate dependent GP by linearly mixing some\nindependent latent GPs. Consider L independent GPs,\n\ngl(x) \u21e0GP (\u232bl(x), l(x, x0)) ,\n\n(1)\nwhere \u232bl(x) : X! R and l(x, x0) : X\u21e5X! R are a mean function and a covariance function,\nrespectively, for the l-th latent GP gl(x), both of which are assumed integrable. De\ufb01ning fs(x) as the\ns-th GP, the S-dimensional dependent GP f (x) = (f1(x), . . . , fS(x))> is assumed to be modeled\nas a linear mixing of the L independent latent GPs, then f (x) is given by\n\nl = 1, . . . , L,\n\nf (x) = Wg(x) + n(x),\n\n(2)\nwhere g(x) = (g1(x), . . . , gL(x))>, W is an S \u21e5 L weight matrix whose (s, l)-entry ws,l 2 R\nis the weight of the l-th latent GP in the s-th data set, and n(x) \u21e0GP (0, \u21e4(x, x0)) is an S-\ndimensional zero-mean Gaussian noise process. Here, 0 is a column vector of 0\u2019s and \u21e4(x, x0) =\ndiag(1(x, x0), . . . , S(x, x0)) with s(x, x0) : X\u21e5X! R being a covariance function for the\ns-th Gaussian noise process. By integrating out g(x), the multivariate GP f (x) is given by\n(3)\nwhere the mean function m(x) : X! RS is given by m(x) = W\u232b(x). The covariance function\nK(x, x0) : X\u21e5X! RS\u21e5S is given by K(x, x0) = W(x, x0)W> + \u21e4(x, x0). Here, \u232b(x) =\n(\u232b1(x), . . . ,\u232b L(x))> and (x, x0) = diag (1(x, x0), . . . , L(x, x0)). The derivation of (3) is\ndescribed in Appendix A of Supplementary Material. The (s, s0)-entry of K(x, x0) is given by\n\nf (x) \u21e0GP (m(x), K(x, x0)) ,\n\nks,s0(x, x0) = s,s0s(x, x0) +\n\nws,lws0,ll(x, x0),\n\n(4)\n\nwhere \u2022,\u2022 represents Kronecker\u2019s delta; Z,Z0 = 1 if Z = Z0 and Z,Z0 = 0 otherwise. The\ncovariance function (4) for the multivariate GP f (x) is represented by the linear combination of the\nl=1 for the latent GPs. The covariance functions for latent GPs are\ncovariance functions {l(x, x0)}L\nshared among all areal data sets, which allows us to effectively learn the spatial correlation for each\ndata set by considering the dependences between data sets; this is advantageous in the case where\nthe number of observations is small, that is, the spatial granularity of the areal data is coarse. In this\npaper we focus on the case L < S, with the aim of reducing the number of free parameters as this\nhelps to avoid over\ufb01tting [30].\nThe areal data are not associated with location points but with regions, and their observations\nare obtained by spatially aggregating the original data. To handle the multivariate areal data, we\n\nLXl=1\n\n4\n\n\fdesign an observation model with a spatial aggregation process for each of the areal data sets. Let\nys = (ys,1, . . . , ys,|Ps|) be a |Ps|-dimensional vector consisting of the areal observations for the\ns-th areal data set. Let y = (y1, y2, . . . , yS)> denote an N-dimensional vector consisting of the\nobservations for all areal data sets, where N =PS\ns=1 |Ps| is the total number of areal observations.\nEach areal observation is assumed to be obtained by integrating the mixed GP f (x) over the\ncorresponding region; y is generated from the Gaussian distribution1,\nA(x)f (x) dx, \u2303\u25c6 ,\ny | f (x) \u21e0N \u2713yZX\n1CCA ,\nA(x) =0BB@\n\nwhere A(x) : X! RN\u21e5S is represented by\na1(x)\n0\n...\n0\n\nin which as(x) =as,1(x), . . . , as,|Ps|(x)>, whose entry as,n(x) is a nonnegative weight function\nfor spatial aggregation over region Rs,n. This formulation does not depend on the particular choice\nof {as,n(x)}, provided that they are integrable. If one takes, for region Rs,n,\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7 aS(x)\n\n0\n...\n0\n\na2(x)\n\n(5)\n\n(6)\n\n0\n0\n...\n\nas,n(x) =\n\n,\n\n(7)\n\n(x 2R s,n)\n(x0 2R s,n) dx0\n\nRX\n\n1I, . . . , 2\n\nSI) in (5) is an N \u21e5 N block diagonal matrix, where 2\n\n(Z) = 1 if Z is true and (Z) = 0 otherwise, then ys,n is the\nwhere (\u2022) is the indicator function;\naverage of fs(x) over Rs,n. We may also consider other aggregation processes to suit the property\nof the areal observations, including simple summation and population-weighted averaging over Rs,n.\ns is the noise variance\n\u2303 = diag(2\nfor the s-th GP, and I is the identity matrix. Figure 2(a) shows a graphical model representation of\nSAGP, where shaded and unshaded nodes indicate observed and latent variables, respectively.\nExtension to the case of multiple domains. It is possible to apply SAGP to areal data sets from\nmultiple domains by assuming that observations are conditionally independent given the latent GPs\nl=1. The graphical model representation of SAGP shown in Figure 2(b) is for the case of two\n{gl(x)}L\ndomains. The superscript in Figure 2(b) is the domain index, and X u is the union of the input spaces\nfor both domains. Although y(1) and y(2) in Figure 2(b) are not directly correlated across domains,\nthe shared covariance functions {l(x, x0)}L\nl=1 for the latent GPs can be learned by transfer learning\nbased on the data sets from multiple domains; thus the spatial correlation for each data set could be\nmore appropriately output via the covariance functions, even if we have only a few data sets available\nfor a single domain. SAGP can be extended to the case of more domains in a similar fashion.\n\n4\n\nInference\n\nGiven the areal data sets, we aim to derive the posterior GP on the basis of a Bayesian inference\nprocedure. The posterior GP can be used for predicting data values at any location point in the\ncontinuous space. The model parameters, W, \u21e4(x, x0), \u2303, \u232b(x), (x, x0), are estimated by\nmaximizing the marginal likelihood, in which multivariate GP f (x) is analytically integrated out; we\nthen construct the posterior GP by using the estimated parameters.\nMarginal likelihood. Consider the case of a single domain. Given the areal data y, the marginal\nlikelihood is given by\n\n(8)\nwhere we analytically integrate out the GP prior f (x). Here, \u00b5 is an N-dimensional mean vector\nrepresented by\n\np(y) = N (y | \u00b5, C) ,\n\nA(x)m(x) dx,\n\n(9)\n\n1We here assume that the integral appearing in (5) is well-de\ufb01ned. It should be noted that without additional\nassumptions sample paths of a Gaussian process are in general not integrable. See Appendix C of Supplementary\nMaterial for discussion on the conditions under which the integral is well-de\ufb01ned.\n\n\u00b5 =ZX\n\n5\n\n\fC =ZZX\u21e5X\nCs,s0 =ZZX\u21e5X\n\n(10)\n\n(11)\n\n(12)\n\nwhich is the integral of the mean function m(x) over the respective regions for all areal data sets. C\nis an N \u21e5 N covariance matrix represented by\n\nA(x)K(x, x0)A(x0)> dx dx0 + \u2303.\n\nIt is an S \u21e5 S block matrix whose (s, s0)-th block Cs,s0 is a |Ps|\u21e5|P s0| matrix represented by\n\nks,s0(x, x0)as(x)as0(x0)> dx dx0 + s,s02\n\ns I.\n\nEquation (11) provides the region-to-region covariance matrices in the form of the double integral of\nthe covariance function ks,s0(x, x0) over the respective pairs of regions in Ps \u21e5P s0; this conceptually\ncorresponds to aggregation of the covariance function values that are calculated at the in\ufb01nite pairs of\nlocation points in the corresponding regions. Since the integrals over regions cannot be calculated\nanalytically, in practice we use a numerical approximation of these integrals. Details are provided at\nthe end of this section. This formulation allows for accurately evaluating the covariance between\nregions considering their shapes; this is extremely helpful as some input data are likely to originate\nfrom irregularly shaped regions (e.g., extremely elongated). By maximizing the logarithm of the\nmarginal likelihood (8), we can estimate the parameters of SAGP.\nTransfer learning across multiple domains. Consider the case of V domains. Let {y(v)}V\nv=1\ndenote the collection of data sets for the V domains. In SAGP, the observations for different domains\nare assumed to be conditionally independent given the shared latent GPs {gl(x)}L\nl=1; the marginal\nlikelihood for V domains is thus given by the product of those for the V domains:\n\np\u21e3y(1), y(2), . . . , y(V )\u2318 =\n\nVYv=1\n\nN\u21e3y(v) \u00b5(v), C(v)\u2318 ,\n\nwhere \u00b5(v) and C(v) are the mean vector and the covariance matrix for the v-th domain, respectively.\nEstimation of model parameters based on (12) allows for transfer learning across the areal data sets\nfrom multiple domains via the shared covariance functions.\nPosterior GP. We have only to consider the case of a single domain, because the derivation of the\nposterior GP can be conducted independently for each domain. Given the areal data y and the\nestimated parameters, the posterior GP f\u21e4(x) is given by\n\nf\u21e4(x) \u21e0GP (m\u21e4(x), K\u21e4(x, x0)) ,\n\n(13)\nwhere m\u21e4(x) : X! RS and K\u21e4(x, x0) : X\u21e5X! RS\u21e5S are the mean function and the covariance\nfunction for f\u21e4(x), respectively. De\ufb01ning H(x) : X! RN\u21e5S as\nA(x0)K(x0, x) dx0,\n\n(14)\n\nH(x) =ZX\n\nwhich consists of the point-to-region covariances, which are the covariances between any location\npoint x and the respective regions in all areal data sets, the mean function m\u21e4(x) and the covariance\nfunction K\u21e4(x, x0) are given by\n\nm\u21e4(x) = m(x) + H(x)>C1(y  \u00b5),\nK\u21e4(x, x0) = K(x, x0)  H(x)>C1H(x0),\n\n(15)\n(16)\nrespectively. We can predict the data value at any location point by using the mean function (15). The\nsecond term in (15) shows that the predictions are calculated by considering the spatial correlations and\nthe dependences between areal data sets, simultaneously. By using the covariance function (16), we\ncan also evaluate the prediction uncertainty. Derivation of the posterior GP is detailed in Appendix B\nof Supplementary Material.\nApproximation of the integral over regions. The integrals over regions in (9), (11), and (14) cannot\nbe performed analytically; thus we approximate these integrals by using suf\ufb01ciently \ufb01ne-grained\nsquare grid cells. We divide input space X into square grid cells, and take Gs,n to be the set of grid\npoints that are contained in region Rs,n. Let us consider the approximation of the integral in the\ncovariance matrix (11). The (n, n0)-entry Cs,s0(n, n0) of Cs,s0 is approximated as follows:\n\n6\n\n\fCs,s0(n, n0) =ZZX\u21e5X\n\n\u21e1\n\n1\n\n|Gs,n|\n\nks,s0(x, x0)as,n(x)as0,n0(x0) dx dx0 + s,s0n,n0s\n\n1\n\n|Gs0,n0| Xi2Gs,n Xj2Gs0,n0\n\nks,s0(i, j) + s,s0n,n0s,\n\n(17)\n\n(18)\n\nwhere we use the formulation of the region-average-observation model (7). The integrals in (9) and\n(14) can be approximated in a similar way. Letting |G| denote the number of all grid points, the\ncomputational complexity of Cs,s0 (11) is O(|G|2); assuming the constant weight as,n(x) = as,n\n(e.g., region average), the computational complexity can be reduced to O(|Ps||Ps0||D|), where |D| is\nthe cardinality of the set of distinct distance values between grid points. Here, we use the property\nthat ks,s0(i, j) in (18) depends only on the distance between i and j. This is useful for reducing\nthe computation time and the memory requirement. The average computation times for inference\nwere 1728.2 and 115.1 seconds for the data sets from New York City and Chicago, respectively; the\nexperiments were conducted on a 3.1 GHz Intel Core i7.\n\n5 Experiments\n\nData. We evaluated SAGP using 10 and 3 real-world areal data sets from two cities, New York City\nand Chicago, respectively. They were obtained from NYC Open Data 2 and Chicago Data Portal 3.\nWe used a variety of areal data sets including poverty rate, air pollution rate, and crime rate. Each data\nset is associated with one of the prede\ufb01ned geographical partitions with various granularities: UHF42\n(42), community district (59), police precinct (77), and zip code (186) in New York City; police\nprecinct (25) and community district (77) in Chicago, where each number in parentheses denotes the\nnumber of regions in the corresponding partition. In the experiments, the data were normalized so\nthat each variable in each city has zero mean and unit variance. Details about the real-world data sets\nare provided in Appendix D of Supplementary Material.\nRe\ufb01nement task. We examined the task of re\ufb01ning coarse-grained areal data by using multiple areal\ndata sets with various granularities. To evaluate the performance in predicting the \ufb01ne-grained areal\ndata, we \ufb01rst picked up one target data set and used its coarser version for learning model parameters;\nthen we predicted the original \ufb01ne-grained target data by using the learned model. Note that the\n\ufb01ne-grained target data was used only for evaluating the re\ufb01nement performance; we did not use\nthem in the inference process. The target data sets were poverty rate (5, 59), PM2.5 (5, 42), crime (5,\n77) in New York City and poverty rate (9, 77) in Chicago, where each pair of numbers in parentheses\ndenotes the numbers of regions in the coarse- and the \ufb01ne-grained partitions, respectively. De\ufb01ning\ns0 as the index of the target data set, the evaluation metric is the mean absolute percentage error\ns0,n is the\n(MAPE) of the \ufb01ne-grained target values,\ntrue value associated with the n-th region in the target \ufb01ne-grained partition; y\u21e4s0,n is its predicted\nvalue, obtained by integrating the s0-th function f\u21e4s0(x) of the posterior GP f\u21e4(x) (13) over the\ncorresponding target \ufb01ne-grained region.\nSetup of the proposed model. In our experiments, we used zero-mean Gaussian processes as the\nlatent GPs {gl(x)}L\nl=1, i.e., \u232bl(x) = 0 for l = 1, . . . , L. We used the following squared-exponential\nl expkx  x0k2/22\nl, where\nkernel as the covariance function for the latent GPs, l(x, x0) = \u21b52\nl is a signal variance that controls the magnitude of the covariance, l is a scale parameter that\n\u21b52\ndetermines the degrees of spatial correlation, and k\u00b7k is the Euclidean norm. Here, we set \u21b52\nl = 1\nbecause the variance can already be represented by scaling the columns of W. For simplicity, the\ncovariance function for the Gaussian noise process n(x, x0) is set to \u21e4(x, x0) = diag(2\n1(x \nS(x  x0)), where (\u2022) is Dirac\u2019s delta function. The model parameters, W, {s}S\ns=1,\nx0), . . . , 2\n\u2303, {l}L\nl=1, were learned by maximizing the logarithm of the marginal likelihood (8) or (12) using\nthe L-BFGS method [15] implemented in SciPy (https://www.scipy.org/). For approximating\nthe integral over regions (see (18)), we divided a total region of each city into suf\ufb01ciently \ufb01ne-grained\nsquare grid cells, the size of which was 300 m \u21e5 300 m for both cities; the resulting sets of grid\npoints G for New York City and Chicago consisted of 9,352 and 7,400 grid points, respectively. The\nnumber L of the latent GPs was chosen from {1, . . . , S} via leave-one-out cross-validation [1]; the\n\n|Ps0|Pn2Ps0(ytrue\n\ns0,n, where ytrue\n\ns0,n  y\u21e4s0,n)/ytrue\n\n1\n\n2https://opendata.cityofnewyork.us\n3https://data.cityofchicago.org/\n\n7\n\n\fTable 1: MAPE and standard errors for the prediction of \ufb01ne-grained areal data (a single city). The\nnumbers in parentheses denote the number L of the latent GPs determined by the validation procedure.\nThe single star (?) and the double star (??) indicate signi\ufb01cant difference between SAGP and other\nmodels at the levels of P values of < 0.05 and < 0.01, respectively.\n\nGPR\n2-stage GP\nSLFM\nSAGP\n\nPoverty rate\n\n0.344 \u00b1 0.046 (\u2013)\n0.210 \u00b1 0.022 (\u2013)\n0.207 \u00b1 0.025 (4)\n0.177 \u00b1 0.019?? (3)\n\nNew York City\n\nPM2.5\n\nCrime\n\nChicago\n\nPoverty rate\n\n0.072 \u00b1 0.010 (\u2013)\n0.042 \u00b1 0.005 (\u2013)\n0.036 \u00b1 0.005 (6)\n0.030 \u00b1 0.005? (5)\n\n0.860 \u00b1 0.102 (\u2013)\n0.454 \u00b1 0.075 (\u2013)\n0.401 \u00b1 0.053 (2)\n0.379 \u00b1 0.055?? (3)\n\n0.599 \u00b1 0.099 (\u2013)\n0.380 \u00b1 0.060 (\u2013)\n0.335 \u00b1 0.052 (2)\n0.278 \u00b1 0.032?? (2)\n\n(a) SAGP\n\n(b) SLFM\n\n(c) Visualization of the estimated parameters W and {l}L\n\nl=1.\n\nFigure 3: (a,b) Re\ufb01ned poverty rate data in NYC, and (c) Visualization of the estimated parameters\nwhen predicting the poverty rate data in NYC. The radii of green and blue circles equal the values of\nl estimated by SAGP and SLFM, respectively. The edge widths are proportional to the absolute\nweights |ws,l| estimated by the respective models. Here, we omitted those edges whose absolute\nweights were lower than a threshold.\nvalidation error was obtained using each held-out coarse-grained data value. Here, the validation was\nconducted on the basis of the coarse-grained target areal data; namely we did not use the \ufb01ne-grained\ntarget data in the validation process.\nBaselines. We compared the proposed model, SAGP, with naive Gaussian process regression\n(GPR) [22], two-stage GP-based model (2-stage GP) [28], and semiparametric latent factor model\n(SLFM) [30]. GPR predicts the \ufb01ne-grained target data simply from just the coarse-grained target\ndata. 2-stage GP is one of the latest regression-based models. SLFM is the multivariate GP model;\nSAGP is regarded as the extension of SLFM. GPR and SLFM assume that data samples are observed\nat location points. We thus associate each areal observation with the centroid of the region. This\nsimpli\ufb01cation is also used for modeling the auxiliary data sets in [28].\nResults for the case of a single city. Table 1 shows MAPE and standard errors for GPR, 2-stage GP,\nSLFM, and SAGP. For all data sets, SAGP achieved better performance in re\ufb01ning coarse-grained\nareal data; the differences between SAGP and the baselines were statistically signi\ufb01cant (Student\u2019s\nt-test). These results show that SAGP can utilize the areal data sets with various granularities from\nthe same city to accurately predict the re\ufb01ned data. The results for all data sets from both cities are\nshown in Appendix E of Supplementary Material.\nFigures 3(a) and 3(b) show the re\ufb01nement results of SAGP and SLFM for the poverty rate data in\nNew York City. Here, the predictive values of each model were normalized to the range [0, 1], and\ndarker hues represent regions with higher values. Compared with the true data in Figure 1, SAGP\nyielded more accurate \ufb01ne-grained data than SLFM. Figure 3(c) visualizes the mixing weights W\nand the scale parameters {l}L\nl=1 estimated by SAGP and SLFM when predicting the \ufb01ne-grained\npoverty rate data in New York City, where we picked up 4 areal data sets: Poverty rate, PM2.5,\nunemployment rate, and the number of 311 calls; their observations were also shown. One observes\nthat the scale parameters estimated by SAGP are relatively small compared with those estimated\n\n8\n\nPovertyPM2.5Unemployment311 call10 kmSLFM (L=4)SAGP (L=3)\fTable 2: MAPE and standard er-\nrors for the prediction of the \ufb01ne-\ngrained data (two cities).\n\nChicago\n\nPoverty rate\n\nSLFM (trans)\nSAGP (trans)\n\n0.328 \u00b1 0.050 (6)\n0.219 \u00b1 0.023 (4)\n\n(a) True\n\n(c) SLFM (trans)\nFigure 5: Re\ufb01ned poverty rate data in Chicago.\n\n(b) SAGP (trans)\n\nby SLFM, presumably because the spatial aggregation process incorporated in SAGP effectively\nseparates intrinsic spatial correlations and apparent smoothing effects due to the spatial aggregation\nto yield areal observations. A comparison of the estimated weights in Figure 3(c) shows that SAGP\nemphasized the useful dependences between data sets, e.g., the strong correlation between the poverty\nrate data and the unemployment rate data.\nOne bene\ufb01t of SAGP is that all predictions as-\nsociated with the target regions have uncertainty\nestimates, where the prediction variance can be\ncalculated by integrating the covariance function\nK\u21e4(x, x0) (16) of the posterior GP (13) over the\ncorresponding target region. Figure 4 visualizes\nthe variance with SAGP in the prediction of the\npoverty rate in New York City and Chicago, re-\nspectively. One observes that the variances at the\nregions located at the edge of the city tend to have\nlarger values compared with those inside the city.\nThis is reasonable because extrapolation is gener-\nally more dif\ufb01cult than interpolation. These uncertainty estimates are useful in that the predictions\nmay help guide policy and planning in a city even if validation of them is dif\ufb01cult.\nResults for the case of two cities. SLFM and SAGP can be used for transfer learning across multiple\ncities, which is more advantageous in such a situation that we have only a few data sets available on a\nsingle city. We here show the results of re\ufb01ning the poverty rate data in Chicago with simultaneously\nutilizing the data sets from New York City. Table 2 shows MAPE and standard errors for SLFM\n(trans) and SAGP (trans). Comparing Tables 1 and 2, one observes that SAGP (trans) attained\nimproved re\ufb01nement performance compared with SLFM (trans) and models trained with only the\ndata in a single city (i.e., Chicago). the differences between SAGP (trans) and the other models were\nstatistically signi\ufb01cant (Student\u2019s t-test, P value of < 0.01). This result shows that SAGP (trans)\ntransferred knowledge across the cities, and yielded better re\ufb01nement results even if there are only\na few data sets available on the target city. Figure 5 shows the re\ufb01nement results for the poverty\nrate data in Chicago. We illustrate the true data on the left in Figure 5, and the predictions attained\nby SAGP (trans) and SLFM (trans) on the right. As shown, SAGP (trans) better identi\ufb01ed the key\nregions compared with SLFM (trans).\n\nFigure 4: Variance of the posterior GP with SAGP\nfor predicting the poverty rate in New York City\n(Left) and Chicago (Right), respectively.\n\n6 Conclusion\n\nThis paper has proposed the Spatially Aggregated Gaussian Processes for inferring the multivariate\nfunction from multiple areal data sets with various granularities. To handle multivariate areal data,\nwe design an observation model with the spatial aggregation process for each areal data set, which\nis the integral of the Gaussian process over the corresponding region. We have con\ufb01rmed that our\nmodel can accurately re\ufb01ne the coarse-grained areal data, and improve the re\ufb01nement performance\nby using the areal data sets from multiple cities.\nThere are several avenues that can be explored in future work. First, we can introduce nonlinear link\nfunctions, as in warped GP [26], and/or alternative likelihoods; this might help handle some kinds of\nobservations (e.g., rates). Second, we can use scalable variational inference with inducing points,\nsimilar to [31], for large-scale data sets. Finally, our formulation provides a general framework for\nmodeling aggregated data and offers a potential research direction; for instance, it has the ability to\nconsider data aggregated on a higher dimensional input space, e.g., spatio-temporal aggregated data.\n\n9\n\n\fReferences\n[1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[2] A. Bogomolov, B. Lepri, J. Staiano, N. Oliver, F. Pianesi, and A. Pentland. Once upon a crime:\nTowards crime prediction from demographics and mobile data. In ICMI, pages 427\u2013434, 2014.\n[3] E. Bonilla, K. M. Chai, and C. Williams. Multi-task Gaussian process prediction. In NeurIPS,\n\npages 153\u2013160, 2008.\n\n[4] P. Boyle and M. Frean. Dependent Gaussian processes. In NeurIPS, pages 217\u2013224, 2005.\n[5] T. M. Burgess and R. Webster. Optimal interpolation and isarithmic mapping of soil properties.\n\nJournal of Soil Science, 31(2), 1980.\n\n[6] M. Gibbs and D. J. C. MacKay. Ef\ufb01cient implementation of Gaussian processes. Technical\n\nReport, 1997.\n\n[7] P. Goovaerts. Combining areal and point data in geostatistical interpolation: Applications to\n\nsoil science and medical geography. Mathematical Geosciences, 42(5):535\u2013554, 2010.\n\n[8] C. A. Gotway and L. J. Young. Combining incompatible spatial data. Journal of the American\n\nStatistical Association, 97(458):632\u2013648, 2002.\n\n[9] O. Hamelijnck, T. Damoulas, K. Wang, and M. Girolami. Multi-resolution multi-task Gaussian\n\nprocesses. In NeurIPS, 2019 (to appear).\n\n[10] D. Higdon. Space and space-time modelling using process convolutions. Quantitative Methods\n\nfor Current Environmental Issues, pages 37\u201356, 2002.\n\n[11] R. Howitt and A. Reynaud. Spatial disaggregation of agricultural production data using\n\nmaximum entropy. European Review of Agricultural Economics, 30(2):359\u2013387, 2003.\n\n[12] M. Jerrett, R. T. Burnett, B. S. Beckerman, M. C. Turner, D. Krewski, and G. Thurston et al.\nSpatial analysis of air pollution and mortality in California. American Journal of Respiratory\nand Critical Care Medicine, 188(5):593\u2013599, 2013.\n\n[13] P. Keil, J. Belmaker, A. M. Wilson, P. Unitt, and W. Jetz. Downscaling of species distribution\n\nmodels: A hierarchical approach. Methods in Ecology and Evolution, 4(1):82\u201394, 2013.\n\n[14] H. C. L. Law, D. Sejdinovic, E. Cameron, T. C. D. Lucas, S. Flaxman, K. Battle, and K. Fuku-\nmizu. Variational learning on aggregate outputs with Gaussian processes. In NeurIPS, pages\n6084\u20136094, 2018.\n\n[15] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization.\n\nMathematical Programming, 45(1\u20133):503\u2013528, 1989.\n\n[16] J. Luttinen and A. Ilin. Variational Gaussian-process factor analysis for modeling spatio-\n\ntemporal data. In NeurIPS, pages 1177\u20131185, 2009.\n\n[17] C. A. Micchelli and M. Pontil. Kernels for multi-task learning. In NeurIPS, pages 921\u2013928,\n\n2004.\n\n[18] D. Murakami and M. Tsutsumi. A new areal interpolation technique based on spatial economet-\n\nrics. Procedia-Social and Behavioral Sciences, 21:230\u2013239, 2011.\n\n[19] R. Murray-Smith and B. A. Pearlmutter. Transformations of Gaussian process priors.\n\nDSMML, pages 110\u2013123, 2004.\n\nIn\n\n[20] D. E. Myers. Co-kriging \u2013 new developments. In G. Verly, M. David, A. G. Journel, and\nA. Marechal, editors, Geostatistics for Natural Resources Characterization: Part 1, volume\n122 of NATO ASI Series C: Mathematical and Physical Sciences, pages 295\u2013305. D. Reidel\nPublishing, Dordrecht, 1984.\n\n[21] N. -W. Park. Spatial downscaling of TRMM precipitation using geostatistics and \ufb01ne scale\n\nenvironmental variables. Advances in Meteorology, 2013:1\u20139, 2013.\n\n10\n\n\f[22] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press,\n\n2006.\n\n[23] A. Rupasinghaa and S. J. Goetz. Social and political forces as determinants of poverty: A\n\nspatial analysis. The Journal of Socio-Economics, 36(4):650\u2013671, 2007.\n\n[24] C. -C. Smith, A. Mashhadi, and L. Capra. Poverty on the cheap: Estimating poverty maps using\n\naggregated mobile communication networks. In CHI, pages 511\u2013520, 2014.\n\n[25] M. T. Smith, M. A. \u00c1lvarez, and N. D. Lawrence. Gaussian process regression for binned data.\n\nIn arXiv e-prints, 2018.\n\n[26] E. Snelson, Z. Ghahramani, and Carl E. Rasmussen. Warped Gaussian processes. In NeurIPS,\n\npages 337\u2013344, 2004.\n\n[27] H. J. W. Sturrock, J. M. Cohen, P. Keil, A. J. Tatem, A. L. Menach, N. E. Ntshalintshali, M. S.\nHsiang, and Roland D Gosling. Fine-scale malaria risk mapping from routine aggregated case\ndata. Malaria Journal, 13:421, 2014.\n\n[28] Y. Tanaka, T. Iwata, T. Tanaka, T. Kurashima, M. Okawa, and H. Toda. Re\ufb01ning coarse-grained\nspatial data using auxiliary spatial data sets with various granularities. In AAAI, pages 5091 \u2013\n5100, 2019.\n\n[29] B. M. Taylor, R. Andrade-Pacheco, and H. J. W. Sturrock. Continuous inference for aggregated\npoint process data. Journal of the Royal Statistical Society: Series A (Statistics in Society), page\n12347, 2018.\n\n[30] Y. W. Teh, M. Seeger, and M. I. Jordan. Semiparametric latent factor models. In AISTATS,\n\npages 333\u2013340, 2005.\n\n[31] M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In AISTATS,\n\npages 567\u2013574, 2009.\n\n[32] H. Wang, D. Kifer, C. Graif, and Z. Li. Crime rate inference with big data. In KDD, pages\n\n635\u2013644, 2016.\n\n[33] R. L. Wilby, S. P. Zorita, E. Timbal, B. Whetton, and L. O. Mearns. Guidelines for Use of\n\nClimate Scenarios Developed from Statistical Downscaling Methods, 2004.\n[34] K. Wilson and J. Wake\ufb01eld. Pointless spatial modeling. Biostatistics, 2018.\n[35] G. Wotling, C. Bouvier, J. Danloux, and J. -M. Fritsch. Regionalization of extreme precipitation\ndistribution using the principal components of the topographical environment. Journal of\nHydrology, 233(1-4):86\u2013101, 2000.\n\n[36] A. Xavier, M. B. C. Freitas, M. D. S. Rosj\u00b4ario, and R. Fragoso. Disaggregating statistical data\n\nat the \ufb01eld level: An entropy approach. Spatial Statistics, 23:91\u2013103, 2016.\n\n[37] F. Youse\ufb01, M. T. Smith, and M. A. \u00c1lvarez. Multi-task learning for aggregated data using\n\nGaussian processes. In NeurIPS, 2019 (to appear).\n\n[38] J. Yuan, Y. Zheng, and X. Xie. Discovering regions of different functions in a city using human\n\nmobility and POIs. In KDD, pages 186\u2013194, 2012.\n\n[39] M. A. \u00c1lvarez, L. Rosasco, and N. D. Lawrence. Kernels for vector-valued functions: A review.\n\nFoundations and Trends R in Machine Learning, 4(3):195\u2013266, 2012.\n\n11\n\n\f", "award": [], "sourceid": 1711, "authors": [{"given_name": "Yusuke", "family_name": "Tanaka", "institution": "NTT"}, {"given_name": "Toshiyuki", "family_name": "Tanaka", "institution": "Kyoto University"}, {"given_name": "Tomoharu", "family_name": "Iwata", "institution": "NTT"}, {"given_name": "Takeshi", "family_name": "Kurashima", "institution": "NTT Corporation"}, {"given_name": "Maya", "family_name": "Okawa", "institution": "NTT"}, {"given_name": "Yasunori", "family_name": "Akagi", "institution": "NTT Service Evolution Laboratories, NTT Corporation"}, {"given_name": "Hiroyuki", "family_name": "Toda", "institution": "NTT Service Evolution Laboratories, NTT Corporation, Japan"}]}