{"title": "Variational Learning on Aggregate Outputs with Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 6081, "page_last": 6091, "abstract": "While a typical supervised learning framework assumes that the inputs and the outputs are measured at the same levels of granularity, many applications, including global mapping of disease, only have access to outputs at a much coarser level than that of the inputs. Aggregation of outputs makes generalization to new inputs much more difficult. We consider an approach to this problem based on variational learning with a model of output aggregation and Gaussian processes, where aggregation leads to intractability of the standard evidence lower bounds. We propose new bounds and tractable approximations, leading to improved prediction accuracy and scalability to large datasets, while explicitly taking uncertainty into account. We develop a framework which extends to several types of likelihoods, including the Poisson model for aggregated count data. We apply our framework to a challenging and important problem, the fine-scale spatial modelling of malaria incidence, with over 1 million observations.", "full_text": "Variational Learning on Aggregate Outputs\n\nwith Gaussian Processes\n\nHo Chung Leon Law\u2217\nUniversity of Oxford\n\nDino Sejdinovic\u2020\nUniversity of Oxford\n\nEwan Cameron\u2021\nUniversity of Oxford\n\nTim CD Lucas\u2021\nUniversity of Oxford\n\nSeth Flaxman\u00a7\n\nImperial College London\n\nKatherine Battle\u2021\nUniversity Of Oxford\n\nKenji Fukumizu\u00b6\n\nInstitute of Statistical Mathematics\n\nAbstract\n\nWhile a typical supervised learning framework assumes that the inputs and the\noutputs are measured at the same levels of granularity, many applications, including\nglobal mapping of disease, only have access to outputs at a much coarser level\nthan that of the inputs. Aggregation of outputs makes generalization to new\ninputs much more dif\ufb01cult. We consider an approach to this problem based on\nvariational learning with a model of output aggregation and Gaussian processes,\nwhere aggregation leads to intractability of the standard evidence lower bounds. We\npropose new bounds and tractable approximations, leading to improved prediction\naccuracy and scalability to large datasets, while explicitly taking uncertainty into\naccount. We develop a framework which extends to several types of likelihoods,\nincluding the Poisson model for aggregated count data. We apply our framework\nto a challenging and important problem, the \ufb01ne-scale spatial modelling of malaria\nincidence, with over 1 million observations.\n\nIntroduction\n\n1\nA typical supervised learning setup assumes existence of a set of input-output examples {(x(cid:96), y(cid:96))}(cid:96)\nfrom which a functional relationship or a conditional probabilistic model of outputs given inputs can be\nlearned. A prototypical use-case is the situation where obtaining outputs y(cid:63) for new, previously unseen,\ninputs x(cid:63) is costly, i.e., labelling is expensive and requires human intervention, but measurements\nof inputs are cheap and automated. Similarly, in many applications, due to a much greater cost\nin acquiring labels, they are only available at a much coarser resolution than the level at which\nthe inputs are available and at which we wish to make predictions. This is the problem of weakly\nsupervised learning on aggregate outputs [14, 20], which has been studied in the literature in a\nvariety of forms, with classi\ufb01cation and regression notably being developed separately and without\nany uni\ufb01ed treatment which can allow more \ufb02exible observation models. In this contribution, we\nconsider a framework of observation models of aggregated outputs given bagged inputs, which reside\nin exponential families. While we develop a more general treatment, the main focus in the paper is\non the Poisson likelihood for count data, which is motivated by the applications in spatial statistics.\nIn particular, we consider the important problem of \ufb01ne-scale mapping of diseases. High resolution\nmaps of infectious disease risk can offer a powerful tool for developing National Strategic Plans,\n\n\u2217Department of Statistics, Oxford, UK. <ho.law@stats.ox.ac.uk>\n\u2020Department of Statistics, Oxford, UK. Alan Turing Institute, London, UK. <dino.sejdinovic@stats.ox.ac.uk>\n\u2021Big Data Institute, Oxford, UK. <dr.ewan.cameron@gmail.com,\ntimcdlucas@gmail.com, kather-\n\u00a7Department of Mathematics and Data Science Institute, London, UK. <s.\ufb02axman@imperial.ac.uk>\n\u00b6Tachikawa, Tokyo, Japan. <fukumizu@ism.ac.jp>\n\nine.battle@bdi.ox.ac.uk>\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fallowing accurate strati\ufb01cation of intervention types to areas of greatest impact [5]. In low resource\nsettings these maps must be constructed through probabilistic models linking the limited observational\ndata to a suite of spatial covariates (often from remote-sensing images) describing social, economic,\nand environmental factors thought to in\ufb02uence exposure to the relevant infectious pathways. In\nthis paper, we apply our method to the incidence of clinical malaria cases. Point incidence data of\nmalaria is typically available at a high temporal frequency (weekly or monthly), but lacks spatial\nprecision, being aggregated by administrative district or by health facility catchment. The challenge\nfor risk modelling is to produce \ufb01ne-scale predictions from these coarse incidence data, leveraging the\nremote-sensing covariates and appropriate regularity conditions to ensure a well-behaved problem.\nMethodologically, the Poisson distribution is a popular choice for modelling count data. In the\nmapping setting, the intensity of the Poisson distribution is modelled as a function of spatial and\nother covariates. We use Gaussian processes (GPs) as a \ufb02exible model for the intensity. GPs are a\nwidely used approach in spatial modelling but also one of the pillars of Bayesian machine learning,\nenabling predictive models which explicitly quantify their uncertainty. Recently, we have seen many\nadvances in variational GP posterior approximations, allowing them to couple with more complex\nobservation likelihoods (e.g. binary or Poisson data [21, 17]) as well as a number of effective scalable\nGP approaches [24, 30, 8, 9], extending the applicability of GPs to dataset sizes previously deemed\nprohibitive.\n\nContribution Our contributions can be summarised as follows. A general framework is de-\nveloped for aggregated observation models using exponential families and Gaussian processes. This\nis novel, as previous work on aggregation or bag models focuses on speci\ufb01c types of output models\nsuch as binary classi\ufb01cation. Tractable and scalable variational inference methods are proposed for\nseveral instances of the aggregated observation models, making use of novel lower bounds on the\nmodel evidence. In experiments, it is demonstrated that the proposed methods can scale to dataset\nsizes of more than 1 million observations. We thoroughly investigate an application of the developed\nmethodology to disease mapping from coarse measurements, where the observation model is Poisson,\ngiving encouraging results. Uncertainty quanti\ufb01cation, which is explicit in our models, is essential\nfor this application.\n\nRelated Work The framework of learning from aggregate data was believed to have been\n\ufb01rst introduced in [20], which considers the two regimes of classi\ufb01cation and regression. However,\nwhile the task of classi\ufb01cation of individuals from aggregate data (also known as learning from\nlabel proportions) has been explored widely in the literature [23, 22, 13, 18, 35, 34, 14], there\nhas been little literature on the analogous regression regime in the machine learning community.\nPerhaps the closest literature available is [13], who considers a general framework for learning\nfrom aggregate data, but also only considers the classi\ufb01cation case for experiments.\nIn this\nwork, we will appropriately adjust the framework in [13] and take this to be our baseline. A\nrelated problem arises in the spatial statistics community under the name of \u2018down-scaling\u2019,\n\u2018\ufb01ne-scale modelling\u2019 or \u2018spatial disaggregation\u2019 [11, 10], in the analysis of disease mapping,\nagricultural data, and species distribution modelling, with a variety of proposed methodologies\n(cf. [33] and references therein), including kriging [6]. However, to the best of our knowledge,\napproaches making use of recent advances in scalable variational inference for GPs are not considered.\n\nAnother closely related topic is multiple instance learning (MIL), concerned with classi\ufb01ca-\ntion with max-aggregation over labels in a bag, i.e. a bag is positively labeled if at least one individual\nis positive, and it is otherwise negatively labelled. While the task in MIL is typically to predict\nlabels of new unobserved bags, [7] demonstrates that individual labels of a GP classi\ufb01er can also be\ninferred in MIL setting with variational inference. Our work parallels that approach, considering\nbag observation models in exponential families and deriving new approximation bounds for some\ncommon generalized linear models. In deriving these bounds, we have taken an approach similar to\n[17], who considers the problem of Gaussian process-modulated Poisson process estimation using\nvariational inference. However, our problem is made more complicated by the aggregation of labels,\nas standard lower bounds to the marginal likelihood used in [17] are also intractable in our model.\nOther related research topics include distribution regression and set regression, as in [28, 15, 16] and\n[36]. In these regression problems, while the input data for learning is the same as the current setup,\nthe goal is to learn a function at the bag level, rather than the individual level, the application of these\nmethods in our setting, naively treating single individuals as \u201cdistributions\u201d, may lead to suboptimal\n\n2\n\n\fperformance. An overview of some other approaches for classi\ufb01cation using bags of instances is\ngiven in [4].\n2 Bag observation model: aggregation in mean parameters\nSuppose we have a statistical model p(y|\u03b7) for output y \u2208 Y, with parameter \u03b7 given by a function of\ninput x \u2208 X , i.e., \u03b7 = \u03b7(x). Although one can formulate p(y|\u03b7) in an arbitrary fashion, practitioners\noften only focus on interpretable simple models, hence we restrict our attention to p(y|\u03b7) arising\nfrom exponential families. We assume that \u03b7 is the mean parameter of the exponential family.\ni \u2208 X such that xa = {xa\n} is a bag of points\nAssume that we have a \ufb01xed set of points xa\n1, . . . , xa\nNa\nwith Na individuals, and we wish to estimate the regression value \u03b7(xa\ni ) for each individual. However,\ninstead of the typical setup where we have a paired sample {(x(cid:96), y(cid:96))}(cid:96) of individuals and their outputs\nto use as a training set, we observe only aggregate outputs ya for each of the bags. Hence, our\ntraining data is of the form\n\ni=1, y1), . . . ({xn\ni}N1\nand the goal is to estimate parameters \u03b7(xa\nand the bag xa = (xa\n\n({x1\n\ni )Na\n\ni=1, we use the following bag observation model:\n\n(1)\ni ) corresponding to individuals. To relate the aggregate ya\n\ni=1, yn),\n\nya|xa \u223c p(y|\u03b7a),\n\n\u03b7a =\n\nwa\n\ni \u03b7(xa\n\ni ),\n\n(2)\n\ni }Nn\nNa(cid:88)\n\ni=1\n\nwhere wa\nis an optional \ufb01xed non-negative weight used to adjust the scales (see Section 3 for\ni\nan example). Note that the aggregation in the bag observation model is on the mean parameters\nfor individuals, not necessarily on the individual responses ya\ni . This implies that each individual\ncontributes to the mean bag response and that the observation model for bags belongs to the same\nparametric form as the one for individuals. For tractable and scalable estimation, we will use\nvariational methods, as the aggregated observation model leads to intractable posteriors. We consider\nthe Poisson, normal, and exponential distributions, but devote a special focus to the Poisson model in\nthis paper, and refer readers to Appendix A for other cases and experimental results for the Normal\nmodel in Appendix H.2.\nIt is also worth noting that we place no restrictions on the collection of the individuals, with the\nbagging process possibly dependent on covariates xa\ni or any unseen factors. The bags can also be\nof different sizes. After we obtain our individual model \u03b7(x), we can use it for prediction of in-bag\nindividuals, as well as out-of-bag individuals.\n3 Poisson bag model: Modelling aggregate counts\nThe Poisson distribution p(y|\u03bb) = \u03bbye\u2212\u03bb/(y!) is considered for count observations, and this paper\ndiscusses the Poisson regression with intensity \u03bb(xa\ni , which is a\nconstant assumed to be known for each individual (or \u2018sub-bag\u2019) in the bag. The population for a bag\n\ni ) multiplied by a \u2018population\u2019 pa\n\na is given by pa =(cid:80)\n\ni pa\n\ni . An observed bag count ya is assumed to follow\nya|xa \u223c Poisson(pa\u03bba),\n\nNa(cid:88)\npa\ni\npa \u03bb(xa\ni ).\ni \u223c Poisson(ya\ni |pa\ni \u03bb(xa\n\n\u03bba :=\n\ni=1\n\nhas the same distribution as(cid:80)Na\n\ni=1 ya\n\nNote that, by introducing unobserved counts ya\n\ni )), the bag observation ya\ni since the Poisson distribution is closed under convolutions.\nIf a bag and its individuals correspond to an area and its partition in geostatistical applications, as\nin the malaria example in Section 4.2, the population in the above bag model can be regarded as\nthe population of an area or a sub-area. With this formulation, the goal is to estimate the basic\nintensity function \u03bb(x) from the aggregated observations (1). Assuming independence given {xa}a,\nthe negative log-likelihood (NLL) (cid:96)0 across bags is\n\u2212 log[\u03a0n\npa\u03bba\u2212ya log(pa\u03bba) c=\n\n(cid:32) Na(cid:88)\n\n(cid:34) Na(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\n(cid:33)(cid:35)\n\na=1p(ya|xa)] c=\n\ni ) \u2212 ya log\n\ni \u03bb(xa\npa\ni )\n\npa\ni \u03bb(xa\n\n,\n\na=1\n\na=1\n\ni=1\n\ni=1\n\n(3)\nwhere c= denotes an equality up to additive constant. During training, this term will pass information\nfrom the bag level observations {ya} to the individual basic intensity \u03bb(xa\ni ). It is noted that once we\n\n3\n\n\fhave trained an appropriate model for \u03bb(xa\ni ), we will be able to make individual level predictions, and\nalso bag level predictions if desired. We will consider baselines with (3) using penalized likelihoods\ninspired by manifold regularization in semi-supervised learning [2] \u2013 presented in Appendix B. In the\nnext section, we propose a model for \u03bb based on GPs.\n\n3.1 VBAgg-Poisson: Gaussian processes for aggregate counts\n\nSuppose now we model f as a Gaussian process (GP), then we have:\n\n(cid:33)\n\n(cid:32) Na(cid:88)\n\ni=1\n\nya|xa \u223c Poisson\n\npa\ni \u03bba\ni\n\n,\n\n\u03bba\ni = \u03a8(f (xa\n\ni )),\n\nf \u223c GP (\u00b5, k)\n\n(4)\n\nwhere \u00b5 and k are some appropriate mean function and covariance kernel k(x, y). (For implementa-\ntion, we consider a constant mean function.) Since the intensity is always non-negative, in all models,\nwe will need to use a transformation \u03bb(x) = \u03a8(f (x)), where \u03a8 is a non-negative valued function.\nWe will consider cases \u03a8(f ) = f 2 and \u03a8(f ) = ef . A discussion of various choices of this link\nfunction in the context of Poisson intensities modulated by GPs is given in [17]. Modelling f with\na GP allows us to propagate uncertainty on the predictions to \u03bba\ni , which is especially important in\nthis weakly supervised problem setting, where we do not directly observe any individual output ya\ni .\nSince the total number of individuals in our target application of disease mapping is typically in the\nmillions (see Section 4.2), we will approximate the posterior over \u03bba\ni ) using variational\ninference, with details found in Appendix E.\nFor scalability of the GP method, as in previous literature [7, 17], we use a set of inducing points\n{u(cid:96)}m\n(cid:96)=1, which are given by the function evaluations of the Gaussian process f at landmark points\nW = {w1, . . . , wm}; i.e., u(cid:96) = f (w(cid:96)). The distribution p(u|W ) is thus given by\n\ni := \u03bb(xa\n\nu \u223c N (\u00b5W , KW W ),\n\n\u00b5W = (\u00b5(w(cid:96)))(cid:96), KW W = (k(ws, wt))s,t.\n\n(5)\n\nThe joint likelihood is given by:\n\nn(cid:89)\n\nNa(cid:89)\n\na=1\n\ni=1\n\np(y, f, u|X, W, \u0398) =\n\nPoisson(ya|pa\u03bba)p(f|u)p(u|W ), with f|u \u223c GP (\u02dc\u00b5u, \u02dcK),\n\n(6)\n\nW W (u \u2212 \u00b5W ),\n\n\u02dc\u00b5(z) = \u00b5z + kzW K\u22121\n\n(7)\nwhere here \u03bba, f depends on i implicitly, kzW = (k(z, w1), . . . , k(z, w(cid:96)))T , with \u00b5W , \u00b5z denoting\ntheir respective evaluations of the mean function \u00b5 and \u0398 are parameters of the mean and kernel\nfunctions of the GP. Proceeding similarly to [17], which discusses (non-bag) Poisson regression with\nGP, we obtain a lower bound of the marginal log-likelihood log p(y|\u0398), introducing a variational\ndistribution q(u) (that we optimise):\n\nW W kW z(cid:48)\n\n\u02dcK(z, z(cid:48)) = k(z, z(cid:48)) \u2212 kzW K\u22121\n\nlog p(y|\u0398) = log\n\np(y, f, u|X, W, \u0398)df du\n\n(cid:90) (cid:90)\n(cid:90) (cid:90)\n(cid:110)\n(cid:90) (cid:90) (cid:110)\n(cid:88)\n\u2212(cid:88)\n\nlog\n\na\n\n\u2265\n\n=\n\n(cid:111)\n\np(y|f, \u0398)\n\np(u|W )\nq(u)\n\n(cid:16) Na(cid:88)\n\np(f|u, \u0398)q(u)df du (Jensen\u2019s inequality)\n\n(cid:17) \u2212(cid:16) Na(cid:88)\n\n(cid:17)(cid:111)\n\np(f|u)q(u)df du\n\nya log\n\ni \u03a8(f (xa\npa\ni )\n\ni \u03a8(f (xa\npa\nlog(ya!) \u2212 KL(q(u)||p(u|W )) =: L(q, \u0398),\n\ni=1\n\ni=1\n\ni ))\n\n(8)\n\na\n\nThe general solution to the maximization over q of the evidence lower bound L(q, \u0398) above is\ngiven by the posterior of the inducing points p(u|y), which is intractable. We introduce a restriction\nto the class of q(u) to approximate the posterior p(u|y). Suppose that the variational distribution\nq(u) is Gaussian, q(u) = N (\u03b7u, \u03a3u). We then need to maximize the lower bound L(q, \u0398) over the\nvariational parameters \u03b7u and \u03a3u.\nThe resulting q(u) gives an approximation to the posterior p(u|y) which also leads to a Gaussian\n\napproximation q(f ) = (cid:82) p(f|u)q(u)du to the posterior p(f|y), which we \ufb01nally then transform\n\n4\n\n\fthrough \u03a8 to obtain the desired approximate posterior on each \u03bb(xi\na) (which is either log-normal\nor non-central \u03c72 depending on the form of \u03a8). The approximate posterior on \u03bb will then allow us\nto make predictions for individuals while, crucially, taking into account the uncertainties in f (note\nthat even the posterior predictive mean of \u03bb will depend on the predictive variance in f due to the\nnonlinearity \u03a8). We also want to emphasis the use of inducing variables is essential for scalability in\nour model: we cannot directly obtain approximations to the posterior of \u03bb(xa\ni ) for all individuals,\nsince this is often large in our problem setting (Section 4.2).\nAs the p(u|W ) and q(u) are both Gaussian, the last term (KL-divergence) of (8) can be computed\nexplicitly with exact form found in Appendix E.3. To consider the \ufb01rst two terms, let qa(va) be the\nmarginal normal distribution of va = (f (xa\n)), where f follows the variational posterior\nq(f ). The distribution of va is then N (ma, Sa), using (7) :\nma = \u00b5xa + KxaW K\u22121\n\n(cid:1) KW xa\n\n(cid:0)K\u22121\n\n1), . . . , f (xa\nNa\n\nW W \u2212 K\u22121\n\nW W \u03a3uK\u22121\n\nW W\n\n(9)\n\nIn the \ufb01rst term of (8), each summand is of the form\n\nW W (\u03b7u \u2212 \u00b5W ), Sa = Kxa,xa \u2212 KxaW\n(cid:90)\n(cid:16) Na(cid:88)\n\nqa(va)dva \u2212 Na(cid:88)\n\n(cid:17)\n\ni \u03a8 (va\npa\ni )\n\npa\ni\n\nlog\n\ni=1\n\ni=1\n\n(cid:90)\n\nya\n\n\u03a8 (va\n\ni ) qa(va)dva,\n\n(10)\n\nin which the second term is tractable for both of \u03a8(f ) = f 2 and \u03a8(f ) = ef . The integral of the\n\ufb01rst term, however with qa Gaussian is not tractable. To solve this, we take different approaches for\n\u03a8(f ) = f 2 and \u03a8(f ) = ef ; for the former, approximation by Taylor expansion is applied, while for\nthe latter, further lower bound is taken.\nFirst consider the case \u03a8(f ) = f 2, and rewrite the \ufb01rst term of (8) as:\n\nyaE log (cid:107)V a(cid:107)2\n\n(cid:1) , \u02dcma = P a1/2ma and \u02dcSa = P a1/2SaP a1/2. By a Taylor series\n\n, where V a \u223c N ( \u02dcma, \u02dcSa),\n\n1, . . . , pa\nNa\n\napproximation for E log (cid:107)V a(cid:107)2 (similar to [29]) around E(cid:107)V a(cid:107)2 = (cid:107) \u02dcma(cid:107)2 + tr \u02dcSa, we obtain\n\n(cid:16) Na(cid:88)\n\u2248 log(cid:0)ma(cid:62)P ama + tr(SaP a)(cid:1) \u2212 2ma(cid:62)P aSaP ama + tr\n\nqa(va)dva\n\ni=1\n\n(cid:16)\n\n(SaP a)2(cid:17)\n\nwith P a = diag(cid:0)pa\n(cid:90)\ni )2(cid:17)\n\npa\ni (va\n\nlog\n\n(ma(cid:62)P ama + tr(SaP a))2\n\n=: \u03b6 a.\n\n(11)\n\nwith details are in Appendix E.4. An alternative approach which we use for the case \u03a8(f ) = ef is to\ntake a further lower bound, which is applicable to a general class of \u03a8 (we provide further details for\nthe analogous approach for \u03a8(v) = v2 in Appendix E.2). We use the following Lemma (proof found\nin Appendix E.1):\nLemma 1. Let v = [v1, . . . , vN ](cid:62) be a random vector with probability density q(v) with marginal\ndensities qi(v), and let wi \u2265 0, i = 1, . . . , N. Then, for any non-negative valued function \u03a8(v),\n\n(cid:17)\n\n(cid:90)\n\n(cid:90)\n\nlog(cid:0) N(cid:88)\n\ni=1\n\n(cid:16) N(cid:88)\n\ni=1\n\nwi\u03a8(vi)(cid:1)q(v)dv \u2265 log\nlog(cid:0) Na(cid:88)\n\n(cid:90)\n\ni eva\npa\n\ni=1\n\nHence we obtain that\n\nwie\u03bei\n\n, where\n\n\u03bei :=\n\nlog \u03a8(vi)qi(vi)dvi.\n\ni(cid:1)qa(va)dva \u2265 log\n\n(cid:17)\n\n,\n\ni ema\npa\n\ni\n\n(cid:16) Na(cid:88)\n\ni=1\n\n(12)\n\nUsing the above two approximation schemes, our objective (up to constant terms) can be formulated\nas: 1) \u03a8(v) = v2\n\nii/2(cid:9) \u2212 KL(q(u)||p(u|W )),\n\ni )2 + Sa\n\n(13)\n\nLs\n1(\u0398, \u03b7u, \u03a3u, W ) :=\n\nn(cid:88)\n\nya\u03b6 a \u2212 n(cid:88)\n\nNa(cid:88)\n\n(cid:8)(ma\n\na=1\n\na=1\n\ni=1\n\n5\n\n\fFigure 1: Left: Random samples on the Swiss roll manifold. Middle, Right: Individual Average\nNLL on train set for varying number of training bags n and increasing Nmean, over 5 repetitions.\nConstant prediction within bag gives a NLL of 2.22. bag-pixel model gives NLL above 2.4 for the\nvarying number of bags experiment.\n\n2) \u03a8(v) = ev\n\nLe\n1(\u0398, \u03b7u, \u03a3u, W ) :=\n\nn(cid:88)\n\na=1\n\nya log(cid:0) Na(cid:88)\n\ni(cid:1) \u2212 n(cid:88)\n\nema\n\nNa(cid:88)\n\ni=1\n\nj=1\n\ni=1\n\nema\n\ni +Sa\n\nii/2 \u2212 KL(q(u)||p(u|W )).\n\n(14)\n\nGiven these objectives, we can now optimise these lower bounds with respect to variational parameters\n{\u03b7u, \u03a3u}, parameters \u0398 of the mean and kernel functions, using stochastic gradient descent (SGD)\non bags. Additionally, we might also learn W , locations for the landmark points. In this form, we\ncan also see that the bound for \u03a8(v) = ev has the added computational advantage of not requiring\nthe full computation of the matrix Sa, but only its diagonals, while for \u03a8(v) = v2 computation of \u03b6 a\ninvolves full Sa, which may be problematic for extremely large bag sizes.\n\n4 Experiments\n\nWe will now demonstrate various approaches: Variational Bayes with Gaussian Process (VBAgg), a\nMAP estimator of Bayesian Poisson regression with explicit feature maps (Nystr\u00f6m) and a neural\nnetwork (NN) \u2013 the latter two employing manifold regularisation with RBF kernel (unless stated\ni = ya\notherwise). For additional baselines, we consider a constant within bag model (constant), i.e. \u02c6\u03bba\npa\nand also consider creating \u2018individual\u2019 covariates by aggregation of the covariates within a bag (bag-\npixel). For details of all these approaches, see Appendix B. We also denote \u03a8(v) = ev and v2 as Exp\nand Sq respectively.\nWe implement our models in TensorFlow6 and use SGD with Adam [12] to optimise their respective\nobjectives, and we split the dataset into 4 parts, namely train, early-stop, validation and test set. Here\nthe early-stop set is used for early stopping for the Nystr\u00f6m, NN and bag-pixel models, while the\nVBAgg approach ignores this partition as it optimises the lower bound to the marginal likelihood.\nThe validation set is used for parameter tuning of any regularisation scaling, as well as learning rate,\nlayer size and multiple initialisations. Throughout, VBAgg and Nystr\u00f6m have access to the same set\nof landmarks for fair comparison. It is also important to highlight that we perform early stopping and\ntuning based on bag level performance on NLL only, as this is the only information available to us.\nFor the VBAgg model, there are two approaches to tuning, one approach is to choose parameters\nbased on NLL on the validation bag sets, another approach is to select all parameters based on the\ntraining objective L1, the lower bound to the marginal likelihood. We denote the latter approach\nVBAgg-Obj and report its toy experimental results in Appendix H.1.1 for presentation purposes.\nIn general, the results are relatively insensitive to this choice, especially when \u03a8(v) = v2. To\nmake predictions, we use the mean of our approximated posterior (provided by a log-normal and\nnon-central \u03c72 distribution for Exp and Sq). As an additional evaluation, we report mean square error\n(MSE) and bag performance results in Appendix H.\n\n6Code is available on https://github.com/hcllaw/VBAgg\n\n6\n\n\f4.1 Poisson Model: Swiss Roll\n\nstandard deviation of Na. We then randomly select(cid:80)n\n\nWe \ufb01rst demonstrate our method on the swiss roll dataset7, illustrated in Figure 1 (left). To make this\nan aggregate learning problem, we \ufb01rst construct n bags with sizes drawn from a negative binomial\ndistribution Na \u223c N B(Nmean, Nstd), where Nmean and Nstd represents the respective mean and\na=1 Na points from the swiss roll manifold to\nbe the locations, giving us a set of colored locations in R3. Ordering these random locations by their\nz-axis coordinate, we group them, \ufb01lling up each bag in turn as we move along the z-axis. The aim\nof this is to simulate that in real life the partitioning of locations into bags is often not independent\nof covariates. Taking the colour of each location as the underlying rate \u03bba\ni at that location, we\ni \u223c P oisson(\u03bba),\nsimulate ya\ni=1 ya\ni . Our goal is then to predict the underlying individual rate parameter \u03bba\ni ,\ngiven only bag-level observations ya. To make this problem even more challenging, we embed\nthe data manifold into R18 by rotating it with a random orthogonal matrix. For the choice of k for\nVBAgg and Nystr\u00f6m, we use the RBF kernel, with the bandwidth parameter learnt. For landmark lo-\ncations, we use the K-means++ algorithm, so that landmark points lie evenly across the data manifold.\n\ni ), and take our observed outputs to be ya =(cid:80)Na\n\nwhere \u03bba = (cid:80)Na\n\ni \u223c P oisson(\u03bba\n\ni=1 \u03bba\n\nVarying number of Bags: n To see the effect of increasing number of bags available for training,\nwe \ufb01x Nmean = 150 and Nstd = 50, and vary the number of bags n for the training set from\n100 to 350 with the same number of bags for early stopping and validation. Each experiment is\nrepeated for 5 runs, and results are shown in Figure 1 for individual NLL on the train set. Again we\nemphasise that the individual labels are not used in training. We see that all versions of VBAgg\noutperform all other models, in terms of MSE and NLL, with statistical signi\ufb01cance con\ufb01rmed by a\nsigned rank permutation test (see Appendix H.1.1). We also observe that the bag-pixel model has\npoor performance, as a result of losing individual level covariate information in training by simply\naggregating them.\n\nVarying number of individuals per bag: Nmean\nTo study the effect of increasing bag\nsizes (with larger bag sizes, we expect \"disaggregation\" to be more dif\ufb01cult), we \ufb01x the number\nof training bags to be 600 with early stopping and validation set to be 150 bags, while varying the\nnumber of individuals per bag through Nmean and Nstd in the negative binomial distribution. To\nkeep the relative scales between Nmean and Nstd the same, we take Nstd = Nmean/2. The results\nare shown in Figure 1, focusing on the best performing methods in the previous experiment. Here, we\nobserve that VBAgg models again perform better than the Nystr\u00f6m and NN models with statistical\nsigni\ufb01cance as reported in Appendix H.1.1, with performance stable as Nmean increases.\n\nDiscussion\nTo gain more insight into the VBAgg model, we look at the calibration of\nour two different Bayesian models: VBAgg-Exp and VBAgg-Square. We compute their respective\nposterior quantiles and observe the ratio of times the true \u03bba\ni lie in these quantiles. We present\nthese in Appendix H.1.1. The calibration plots reveal an interesting nature about using the two\ndifferent approximations for using ev versus v2 for \u03a8(v). While experiments showed that the two\nmodel perform similarly in terms of NLL, the calibration of the models is very different. While the\nVBAgg-Square is well calibrated in general, the VBAgg-Exp suffers from poor calibration. This is\nnot surprising, as VBAgg-Exp uses an additional lower bound on model evidence. Thus, uncertainty\nestimates given by VBAgg-Exp should be treated with care.\n\n4.2 Malaria Incidence Prediction\n\nWe now demonstrate the proposed methodology on an important real life malaria prediction problem\nfor an endemic country from the Malaria Atlas Project database8. In this problem, we would like\nto predict the underlying malaria incidence rate in each 1km by 1km region (referred to as a pixel),\nwhile having only observed aggregated incidences of malaria ya at much larger regional levels, which\nare treated as bags of pixels. These bags are non-overlapping administrative units, with Na pixels per\nbag ranging from 13 to 6,667, with a total of 1,044,683 pixels. In total, data is available for 957 bags9.\n\n7The swiss roll manifold function (for sampling) can be found on the Python scikit-learn package.\n8Due to con\ufb01dentiality reasons, we do not report country or plot the full map of our results.\n9We consider 576 bags for train, 95 bags each for validation and early-stop, with 191 bags for testing, with\n\ndifferent splits across different trials, selecting them to ensure distributions of labels are similar across sets.\n\n7\n\n\fFigure 2: Triangle denotes approximate start and end of river location, crosses denotes non-train set\nbags. Malaria incidence rate \u03bba\ni ), with constant model\n(Left), and VBAgg-Obj-Sq (tuned on Ls\n1) (Middle). Right: Standard deviation of the posterior v in\n(9) with VBAgg-Obj-Sq.\n\ni is per 1000 people. Left, Middle: log(\u02c6\u03bba\n\ni , as well as covariates xa\n\ni (per 1000 people) for pixel i in bag a,\nAlong with these pixels, we also have population estimates pa\ni \u2208 R18, collected by remote sensing. Some\nspatial coordinates given by sa\nexamples of covariates includes accessibility, distance to water, mean of land surface temperature\nand stable night lights. It is clear that rather than expecting malaria incidence rate to be constant\nthroughout the entire bag (as in Figure 2), we expect pixel incidence rate to vary, depending on social,\neconomic and environmental factors [32]. Our goal is therefore to build models that can predict\nmalaria incidence rates at a pixel level.\n\nWe assume a Poisson model on each individual pixel, i.e. ya \u223c P oisson((cid:80)\n\ni pa\n\ni \u03bba\n\ni ), where \u03bba\n\ni is\nthe underlying pixel incidence rate of malaria per 1000 people that we are interested in predicting.\nWe consider the VBAgg, Nystr\u00f6m and NN as prediction models and use a kernel given as a sum of\nan ARD (automatic relevance determination) kernel on covariates and a Mat\u00e9rn kernel on spatial\nlocations for the VBAgg and Nystr\u00f6m methods, learning all kernel parameters (the kernel expression\nis provided in Appendix G). We use the same kernel for manifold regularisation in the NN model.\nThis kernel choice incorporates spatial information, while allowing feature selection amongst other\ncovariates. For choice of landmarks, we ensure landmarks are placed evenly throughout space by\nusing one landmark point per training bag (selected by k-means++). This is so that the uncertainty\nestimates we obtain are not too sensitive to the choice of landmarks. In this problem, no individual-\nlevel labels are available, so we report Bag NLL and MSE (on observed incidences) on the test bags\nin Appendix G over 10 different re-splits of the data. Although we can see that Nystr\u00f6m is the best\nperforming method, the improvement over VBAgg models is not statistically signi\ufb01cant. On the\nother hand, both VBAgg and Nystr\u00f6m models statistically signi\ufb01cantly outperform NN, which also\nhas some instability in its predictions, as discussed in Appendix G.1. However, a caution should be\nexercised when using the measure of performance at the bag level as a surrogate for the measure of\nperformance at the individual level: in order to perform well at the bag level, one can simply utilise\nspatial coordinates and ignore other covariates, as malaria intensity appears to smoothly vary between\nthe bags (Left of Figure 2). However, we do not expect this to be true at the individual level.\nTo further investigate this, we consider a particular region, and look at the predicted individual malaria\nincidence rate, with results found in Figure 2 and in Appendix G.1 across 3 different data splits,\nwhere the behaviours of each of these models can be observed. While Nystr\u00f6m and VBAgg methods\nboth provide good bag-level performance, Nystr\u00f6m and VBAgg-Exp can sometimes provide overly-\nsmooth spatial patterns, which does not seem to be the case for the VBAgg-Sq method (recall that\nVBAgg-Sq performed best in both prediction and calibration for the toy experiments). In particular,\nVBAgg-Sq consistently predicts higher intensity along rivers (a known factor [31]; indicated by\ntriangles in Figure 2) using only coarse aggregated intensities, demonstrating that prediction of\n(unobserved) pixel-level intensities is possible using \ufb01ne-scale environmental covariates, especially\nones known to be relevant such as covariates indicated by the Topographic Wetness Index, a measure\nof wetness, see Appendix G.2 for more details.\nIn summary, by optimising the lower bound to the marginal likelihood, the proposed variational\nmethods are able to learn useful relations between the covariates and pixel level intensities, while\navoiding the issue of over\ufb01tting to spatial coordinates. Furthermore, they also give uncertainty\nestimates (Figure 2, right), which are essential for problems like these, where validation of predictions\nis dif\ufb01cult, but they may guide policy and planning.\n\n8\n\n\f5 Conclusion\n\nMotivated by the vitally important problem of malaria, which is the direct cause of around 187\nmillion clinical cases [3] and 631,000 deaths [5] each year in sub-Saharan Africa, we have proposed a\ngeneral framework of aggregated observation models using Gaussian processes, along with scalable\nvariational methods for inference in those models, making them applicable to large datasets. The\nproposed method allows learning in situations where outputs of interest are available at a much coarser\nlevel than that of the inputs, while explicitly quantifying uncertainty of predictions. The recent uptake\nof digital health information systems offers a wealth of new data which is abstracted to the aggregate\nor regional levels to preserve patient anonymity. The volume of this data, as well as the availability of\nmuch more granular covariates provided by remote sensing and other geospatially tagged data sources,\nallows to probabilistically disaggregate outputs of interest for \ufb01ner risk strati\ufb01cation, e.g. assisting\npublic health agencies to plan the delivery of disease interventions. This task demands new high-\nperformance machine learning methods and we see those that we have developed here as an important\nstep in this direction.\n\nAcknowledgement\n\nWe thank Kaspar Martens for useful discussions, and Dougal Sutherland for providing the code\nbase in which this work was based on. HCLL is supported by the EPSRC and MRC through the\nOxWaSP CDT programme (EP/L016710/1). HCLL and KF are supported by JSPS KAKENHI\n26280009. EC and KB are supported by OPP1152978, TL by OPP1132415 and the MAP database\nby OPP1106023. DS is supported in part by the ERC (FP7/617071) and by The Alan Turing Institute\n(EP/N510129/1). The data were provided by the Malaria Atlas Project supported by the Bill and\nMelinda Gates Foundation.\n\n9\n\n\fReferences\n[1] LU Ancarani and G Gasaneo. Derivatives of any order of the con\ufb02uent hypergeometric function f\n1 1 (a, b, z) with respect to the parameter a or b. Journal of Mathematical Physics, 49(6):063508,\n2008.\n\n[2] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of machine learning\nresearch, 7(Nov):2399\u20132434, 2006.\n\n[3] Samir Bhatt, DJ Weiss, E Cameron, D Bisanzio, B Mappin, U Dalrymple, KE Battle, CL Moyes,\nA Henry, PA Eckhoff, et al. The effect of malaria control on plasmodium falciparum in africa\nbetween 2000 and 2015. Nature, 526(7572):207, 2015.\n\n[4] Veronika Cheplygina, David M.J. Tax, and Marco Loog. On classi\ufb01cation with bags, groups\n\nand sets. Pattern Recognition Letters, 59:11 \u2013 17, 2015.\n\n[5] Peter W Gething, Daniel C Casey, Daniel J Weiss, Donal Bisanzio, Samir Bhatt, Ewan Cameron,\nKatherine E Battle, Ursula Dalrymple, Jennifer Rozier, Puja C Rao, et al. Mapping plasmodium\nfalciparum mortality in africa between 1990 and 2015. New England Journal of Medicine,\n375(25):2435\u20132445, 2016.\n\n[6] Pierre Goovaerts. Combining areal and point data in geostatistical interpolation: Applications\nto soil science and medical geography. Mathematical Geosciences, 42(5):535\u2013554, Jul 2010.\n\n[7] Manuel Hau\u00dfmann, Fred A Hamprecht, and Melih Kandemir. Variational bayesian multiple\ninstance learning with gaussian processes. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 6570\u20136579, 2017.\n\n[8] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. 2013.\n\n[9] James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable Variational Gaussian\nProcess Classi\ufb01cation. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of\nthe Eighteenth International Conference on Arti\ufb01cial Intelligence and Statistics, volume 38\nof Proceedings of Machine Learning Research, pages 351\u2013360, San Diego, California, USA,\n09\u201312 May 2015. PMLR.\n\n[10] Richard Howitt and Arnaud Reynaud. Spatial disaggregation of agricultural production data\nusing maximum entropy. European Review of Agricultural Economics, 30(3):359\u2013387, 2003.\n\n[11] Petr Keil, Jonathan Belmaker, Adam M Wilson, Philip Unitt, and Walter Jetz. Downscaling\nof species distribution models: a hierarchical approach. Methods in Ecology and Evolution,\n4(1):82\u201394, 2013.\n\n[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[13] Dimitrios Kotzias, Misha Denil, Nando De Freitas, and Padhraic Smyth. From group to\nindividual labels using deep features. In Proceedings of the 21th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 597\u2013606. ACM, 2015.\n\n[14] H. Kueck and N. de Freitas. Learning about individuals from group statistics. In UAI, pages\n\n332\u2013339, 2005.\n\n[15] H. C. L. Law, C. Yau, and D. Sejdinovic. Testing and learning on distributions with symmetric\n\nnoise invariance. In NIPS, 2017.\n\n[16] Ho Chung Leon Law, Dougal Sutherland, Dino Sejdinovic, and Seth Flaxman. Bayesian\napproaches to distribution regression. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 1167\u20131176, 2018.\n\n[17] Chris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. Variational inference\nfor gaussian process modulated poisson processes. In International Conference on Machine\nLearning, pages 1814\u20131822, 2015.\n\n10\n\n\f[18] Vitalik Melnikov and Eyke H\u00fcllermeier. Learning to aggregate using uninorms.\n\nIn Joint\nEuropean Conference on Machine Learning and Knowledge Discovery in Databases, pages\n756\u2013771. Springer, 2016.\n\n[19] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch\u00f6lkopf. Kernel\nmean embedding of distributions: A review and beyonds. arXiv preprint arXiv:1605.09522,\n2016.\n\n[20] David R Musicant, Janara M Christensen, and Jamie F Olson. Supervised learning by training on\naggregate outputs. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference\non, pages 252\u2013261. IEEE, 2007.\n\n[21] H. Nickisch and CE. Rasmussen. Approximations for binary gaussian process classi\ufb01cation.\n\nJournal of Machine Learning Research, 9:2035\u20132078, October 2008.\n\n[22] Giorgio Patrini, Richard Nock, Tiberio Caetano, and Paul Rivera. (Almost) no label no cry. In\n\nNIPS. 2014.\n\n[23] Novi Quadrianto, Alex J Smola, Tiberio S Caetano, and Quoc V Le. Estimating labels from\n\nlabel proportions. JMLR, 10:2349\u20132374, 2009.\n\n[24] Joaquin Qui\u00f1onero Candela and Carl Edward Rasmussen. A unifying view of sparse approxi-\n\nmate gaussian process regression. J. Mach. Learn. Res., 6:1939\u20131959, December 2005.\n\n[25] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In NIPS,\n\npages 1177\u20131184, 2007.\n\n[26] Carl Edward Rasmussen and Christopher KI Williams. Gaussian processes for machine learning,\n\n2006.\n\n[27] Alex J Smola and Peter L Bartlett. Sparse greedy gaussian process regression. In Advances in\n\nneural information processing systems, pages 619\u2013625, 2001.\n\n[28] Zolt\u00e1n Szab\u00f3, Bharath K Sriperumbudur, Barnab\u00e1s P\u00f3czos, and Arthur Gretton. Learning theory\nfor distribution regression. The Journal of Machine Learning Research, 17(1):5272\u20135311, 2016.\n\n[29] Yee W Teh, David Newman, and Max Welling. A collapsed variational bayesian inference\nalgorithm for latent dirichlet allocation. In Advances in neural information processing systems,\npages 1353\u20131360, 2007.\n\n[30] Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In\nDavid van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference\non Arti\ufb01cial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research,\npages 567\u2013574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16\u201318 Apr\n2009. PMLR.\n\n[31] DA Warrel, T Cox, J Firth, and Jr E Benz. Oxford textbook of medicine, 2017.\n\n[32] Daniel J Weiss, Bonnie Mappin, Ursula Dalrymple, Samir Bhatt, Ewan Cameron, Simon I\nHay, and Peter W Gething. Re-examining environmental correlates of plasmodium falciparum\nmalaria endemicity: a data-intensive variable selection approach. Malaria journal, 14(1):68,\n2015.\n\n[33] Ant\u00f3nio Xavier, Maria de Bel\u00e9m Costa Freitas, Maria do Socorro Ros\u00e1rio, and Rui Fragoso.\nDisaggregating statistical data at the \ufb01eld level: An entropy approach. Spatial Statistics, 23:91 \u2013\n108, 2018.\n\n[34] Felix X Yu, Krzysztof Choromanski, Sanjiv Kumar, Tony Jebara, and Shih-Fu Chang. On\n\nlearning from label proportions. arXiv preprint arXiv:1402.5902, 2014.\n\n[35] Felix X Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, and Shih-Fu Chang. propto svm for learning\n\nwith label proportions. arXiv preprint arXiv:1306.0886, 2013.\n\n[36] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov,\n\nand Alexander Smola. Deep sets. In NIPS, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2992, "authors": [{"given_name": "Ho Chung", "family_name": "Law", "institution": "University of Oxford"}, {"given_name": "Dino", "family_name": "Sejdinovic", "institution": "University of Oxford"}, {"given_name": "Ewan", "family_name": "Cameron", "institution": null}, {"given_name": "Tim", "family_name": "Lucas", "institution": "University of Oxford"}, {"given_name": "Seth", "family_name": "Flaxman", "institution": "Imperial College London"}, {"given_name": "Katherine", "family_name": "Battle", "institution": "University of Oxford"}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": "Institute of Statistical Mathematics"}]}