{"title": "Probabilistic Models for Integration Error in the Assessment of Functional Cardiac Models", "book": "Advances in Neural Information Processing Systems", "page_first": 110, "page_last": 118, "abstract": "This paper studies the numerical computation of integrals, representing estimates or predictions, over the output $f(x)$ of a computational model with respect to a distribution $p(\\mathrm{d}x)$ over uncertain inputs $x$ to the model. For the functional cardiac models that motivate this work, neither $f$ nor $p$ possess a closed-form expression and evaluation of either requires $\\approx$ 100 CPU hours, precluding standard numerical integration methods. Our proposal is to treat integration as an estimation problem, with a joint model for both the a priori unknown function $f$ and the a priori unknown distribution $p$. The result is a posterior distribution over the integral that explicitly accounts for dual sources of numerical approximation error due to a severely limited computational budget. This construction is applied to account, in a statistically principled manner, for the impact of numerical errors that (at present) are confounding factors in functional cardiac model assessment.", "full_text": "Probabilistic Models for Integration Error in the\n\nAssessment of Functional Cardiac Models\n\nChris. J. Oates1,5, Steven Niederer2, Angela Lee2, Fran\u00e7ois-Xavier Briol3, Mark Girolami4,5\n\n1Newcastle University, 2King\u2019s College London, 3University of Warwick,\n\n4Imperial College London, 5Alan Turing Institute\n\nAbstract\n\nThis paper studies the numerical computation of integrals, representing estimates\nor predictions, over the output f (x) of a computational model with respect to\na distribution p(dx) over uncertain inputs x to the model. For the functional\ncardiac models that motivate this work, neither f nor p possess a closed-form\nexpression and evaluation of either requires \u2248 100 CPU hours, precluding standard\nnumerical integration methods. Our proposal is to treat integration as an estimation\nproblem, with a joint model for both the a priori unknown function f and the a\npriori unknown distribution p. The result is a posterior distribution over the integral\nthat explicitly accounts for dual sources of numerical approximation error due to a\nseverely limited computational budget. This construction is applied to account, in\na statistically principled manner, for the impact of numerical errors that (at present)\nare confounding factors in functional cardiac model assessment.\n\n1 Motivation: Predictive Assessment of Computer Models\n\nThis paper considers the problem of assessment for computer models [7], motivated by an urgent\nneed to assess the performance of sophisticated functional cardiac models [25]. In concrete terms,\nthe problem that we consider can be expressed as the numerical approximation of integrals\n\n(cid:90)\n\np(f ) =\n\nf (x)p(dx),\n\n(1)\n\nwhere f (x) denotes a functional of the output from a computer model and x denotes unknown inputs\n(or \u2018parameters\u2019) of the model. The term p(x) denotes a posterior distribution over model inputs.\nAlthough not our focus in this paper, we note that p(x) is de\ufb01ned based on a prior \u03c00(x) over these\ninputs and training data y assumed to follow the computer model \u03c0(y|x) itself. The integral p(f ), in\nour context, represents a posterior prediction of actual cardiac behaviour. The computational model\ncan be assessed through comparison of these predictions to test data generated from an experiment.\nThe challenging nature of cardiac models \u2013 and indeed computer models in general \u2013 is such that a\nclosed-form for both f (x) and p(dx) is precluded [23]. Instead, it is typical to be provided with a\n\ufb01nite collection of samples {xi}n\ni=1 obtained from p(dx) through Monte Carlo (or related) methods\n[32]. The integrand f (x) is then evaluated at these n input con\ufb01gurations, to obtain {f (xi)}n\ni=1.\nLimited computational budgets necessitate that the number n is small and, in such situations, the error\nof an estimator for the integral p(f ) based on the data {(xi, f (xi))}n\ni=1 is subject to strict information-\ntheoretic lower bounds [26]. The practical consequence is that an unknown (non-negligible) numerical\nerror is introduced in the numerical approximation of p(f ), unrelated to the performance of the model.\nIf this numerical error is ignored, it will constitute a confounding factor in the assessment of predictive\nperformance for the computer model. It is therefore unclear how a fair model assessment can proceed.\nThis motivates an attempt to understand the extent of numerical error in any estimate of p(f ). This is\nnon-trivial; for example, the error distribution of the arithmetic mean 1\ni=1f (xi) depends on the\n\nn \u03a3n\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\funknown f and p, and attempts to estimate this distribution solely from data, e.g. via a bootstrap or a\ncentral limit approximation, cannot succeed in general when the number of samples n is small [27].\nOur \ufb01rst contribution, in this paper, is to argue that approximation of p(f ) from samples {xi}n\nand function evaluations {f (xi)}n\ni=1\ni=1 can be cast as an estimation task. Our second contribution is to\nderive a posterior distribution over the unknown value p(f ) of the integral. This distribution provides\nan interpretable quanti\ufb01cation of the extent of numerical integration error that can be reasoned\nwith and propagated through subsequent model assessment. Our third contribution is to establish\ntheoretical properties of the proposed method. The method we present falls within the framework\nof Probabilistic Numerics and our work can be seen as a contribution to this emerging area [16, 5].\nIn particular, the method proposed is reminiscent of Bayesian Quadrature (BQ) [9, 28, 29, 15]. In\nBQ, a Gaussian prior measure is placed on the unknown function f and is updated to a posterior\nwhen conditioned on the information {(xi, f (xi))}n\ni=1. This induces both a prior and a posterior\nover the value of p(f ) as push-forward measures under the projection operator f (cid:55)\u2192 p(f ). Since\nits introduction, several authors have related BQ to other methods such as the \u2018herding\u2019 approach\nfrom machine learning [17, 3], random feature approximations used in kernel methods [1], classical\nquadrature rules [33] and Quasi Monte Carlo (QMC) methods [4]. Most recently, [21] extended\ntheoretical results for BQ to misspeci\ufb01ed prior models, and [22] who provided ef\ufb01cient matrix\nalgebraic methods for the implementation of BQ. However, as an important point of distinction,\nnotice that BQ pre-supposes p(dx) is known in closed-form - it does not apply in situations where\np(dx) is instead sampled. In this latter case p(dx) will be called an intractable distribution and, for\nmodel assessment, this scenario is typical.\nTo extend BQ to intractable distributions, this paper proposes to use a Dirichlet process mixture\nprior to estimate the unknown distribution p(dx) from Monte Carlo samples {xi}n\ni=1 [12]. It will be\ndemonstrated that this leads to a simple expression for the closed-form terms which are required to\nimplement the usual BQ. The overall method, called Dirichlet process mixture Bayesian quadrature\n(DPMBQ), constructs a (univariate) distribution over the unknown integral p(f ) that can be exploited\nto tease apart the intrinsic performance of a model from numerical integration error in model\nassessment. Note that BQ was used to estimate marginal likelihood in e.g. [30]. The present problem\nis distinct, in that we focus on predictive performance (of posterior expectations) rather than marginal\nlikelihood, and its solution demands a correspondingly different methodological development.\nOn the computational front, DPMBQ costs O(n3). However, this cost is de-coupled from the often\norders-of-magnitude larger costs involved in both evaluation of f (x) and p(dx), which form the main\ncomputational bottleneck. Indeed, in the modern computational cardiac models that motivate this\nresearch, the \u2248 100 CPU hour time required for a single simulation limits the number n of available\nsamples to \u2248 103 [25]. At this scale, numerical integration error cannot be neglected in model\nassessment. This raises challenges when making assessments or comparisons between models, since\nthe intrinsic performance of models cannot be separated from numerical error that is introduced into\nthe assessment. Moreover, there is an urgent ethical imperative that the clinical translation of such\nmodels is accompanied with a detailed quanti\ufb01cation of the unknown numerical error component in\nmodel assessment. Our contribution explicitly demonstrates how this might be achieved.\nThe remainder of the paper proceeds as follows: In Section 2.1 we \ufb01rst recall the usual BQ method,\nthen in Section 2.2 we present and analyse our novel DPMBQ method. Proofs of theoretical results\nare contained in the electronic supplement. Empirical results are presented in Section 3 and the paper\nconcludes with a discussion in Section 4.\n\n2 Probabilistic Models for Numerical Integration Error\nConsider a domain \u2126 \u2286 Rd, together with a distribution p(dx) on \u2126. As in Eqn. 1, p(f ) will be used\nto denote the integral of the argument f with respect to the distribution p(dx). All integrands are\nassumed to be (measurable) functions f : \u2126 \u2192 R such that the integral p(f ) is well-de\ufb01ned. To\nbegin, we recall details for the BQ method when p(dx) is known in closed-form [9, 28]:\n\n2.1 Probabilistic Integration for Tractable Distributions (BQ)\nIn standard BQ [9, 28], a Gaussian Process (GP) prior f \u223c GP(m, k) is assigned to the integrand f,\nwith mean function m : \u2126 \u2192 R and covariance function k : \u2126 \u00d7 \u2126 \u2192 R [see 31, for further details\n\n2\n\n\fp(f ) \u223c N(p(m), p \u2297 p(k))\n\nunder our notational convention the so-called initial error p\u2297p(k) is equal to(cid:82)(cid:82) k(x, x(cid:48))p(dx)p(dx(cid:48)).\n\non GPs]. The implied prior over the integral p(f ) is then the push-forward of the GP prior through\nthe projection f (cid:55)\u2192 p(f ):\nwhere p\u2297p : \u2126\u00d7\u2126 \u2192 R is the measure formed by independent products of p(dx) and p(dx(cid:48)), so that\nNext, the GP is conditioned on the information in {(xi, f (xi))}n\ni=1. The conditional GP takes a\nconjugate form f|X, f (X) \u223c GP(mn, kn), where we have written X = (x1, . . . , xn), f (X) =\n(f (x1), . . . , f (xn))(cid:62). Formulae for the mean function mn : \u2126 \u2192 R and covariance function\nkn : \u2126 \u00d7 \u2126 \u2192 R are standard can be found in [31, Eqns. 2.23, 2.24]. The BQ posterior over p(f ) is\nthe push forward of the GP posterior:\nFormulae for p(mn) and p \u2297 p(kn) were derived in [28]:\n\np(f ) | X, f (X) \u223c N(p(mn), p \u2297 p(kn))\n\np(mn) = f (X)(cid:62)k(X, X)\u22121\u00b5(X)\n\n(3)\n(4)\nwhere k(X, X) is the n \u00d7 n matrix with (i, j)th entry k(xi, xj) and \u00b5(X) is the n \u00d7 1 vector with\nith entry \u00b5(xi) where the function \u00b5 is called the kernel mean or kernel embedding [see e.g. 35]:\n\np \u2297 p(kn) = p \u2297 p(k) \u2212 \u00b5(X)(cid:62)k(X, X)\u22121\u00b5(X)\n\n(2)\n\n\u00b5(x) =\n\nk(x, x(cid:48))p(dx(cid:48))\n\n(5)\n\nComputation of the kernel mean and the initial error each requires that p(dx) is known in general.\nThe posterior in Eqn. 2 was studied in [4], where rates of posterior contraction were established\nunder further assumptions on the smoothness of the covariance function k and the smoothness of\nthe integrand. Note that the matrix inverse of k(X, X) incurs a (naive) computational cost of O(n3);\nhowever this cost is post-hoc and decoupled from (more expensive) computation that involves the\ncomputer model. Sparse or approximate GP methods could also be used.\n\n2.2 Probabilistic Integration for Intractable Distributions\n\nThe dependence of Eqns. 3 and 4 on both the kernel mean and the initial error means that BQ cannot\nbe used for intractable p(dx) in general. To address this we construct a second non-parametric model\nfor the unknown p(dx), presented next.\n\nDirichlet Process Mixture Model Consider an in\ufb01nite mixture model\n\np(dx) =\n\n\u03c8(dx; \u03c6)P (d\u03c6),\n\n(6)\nwhere \u03c8 : \u2126 \u00d7 \u03a6 \u2192 [0,\u221e) is such that \u03c8(\u00b7; \u03c6) is a distribution on \u2126 with parameter \u03c6 \u2208 \u03a6 and P is\na mixing distribution de\ufb01ned on \u03a6. In this paper, each data point xi is modelled as an independent\ndraw from p(dx) and is associated with a latent variable \u03c6i \u2208 \u03a6 according to the generative process\nof Eqn. 6. i.e. xi \u223c \u03c8(\u00b7; \u03c6i). To limit scope, the extension to correlated xi is reserved for future\nwork.\nThe Dirichlet process (DP) is the natural conjugate prior for non-parametric discrete distributions\n[12]. Here we endow P (d\u03c6) with a DP prior P \u223c DP(\u03b1, Pb), where \u03b1 > 0 is a concentration\nparameter and Pb(d\u03c6) is a base distribution over \u03a6. The base distribution Pb coincides with the\nprior expectation E[P (d\u03c6)] = Pb(d\u03c6), while \u03b1 determines the spread of the prior about Pb. The\nDP is characterised by the property that, for any \ufb01nite partition \u03a6 = \u03a61 \u222a \u00b7\u00b7\u00b7 \u222a \u03a6m, it holds that\n(P (\u03a61), . . . , P (\u03a6m)) \u223c Dir(\u03b1Pb(\u03a61), . . . , \u03b1Pb(\u03a6m)) where P (S) denotes the measure of the set\nS \u2286 \u03a6. For \u03b1 \u2192 0, the DP is supported on the set of atomic distributions, while for \u03b1 \u2192 \u221e, the DP\nconverges to an atom on the base distribution. This overall approach is called a DP mixture (DPM)\nmodel [13].\nFor a random variable Z, the notation [Z] will be used as shorthand to denote the density function\nof Z. It will be helpful to note that for \u03c6i \u223c P independent, writing \u03c61:n = (\u03c61, . . . , \u03c6n), standard\nconjugate results for DPs lead to the conditional\n\n(cid:90)\n\n(cid:90)\n\n(cid:16)\n\nn(cid:88)\n\n(cid:17)\n\nP | \u03c61:n \u223c DP\n\n\u03b1 + n,\n\n\u03b1\n\n\u03b1 + n\n\nPb +\n\n1\n\n\u03b4\u03c6i\n\n\u03b1 + n\n\ni=1\n\nwhere \u03b4\u03c6i(d\u03c6) is an atomic distribution centred at the location \u03c6i of the ith sample in \u03c61:n. In turn,\nthis induces a conditional [dp|\u03c61:n] for the unknown distribution p(dx) through Eqn. 6.\n\n3\n\n\f\u03d5j\n\niid\u223c \u03b1\n\n\u03b1 + n\n\nPb +\n\n1\n\n\u03b1 + n\n\ni=1\n\n\u03b4\u03c6i,\n\niid\u223c Beta(1, \u03b1 + n)\n\n\u03b2j\n\nIn practice the sum in Eqn. 7 may be truncated at a large \ufb01nite number of terms, N, with negligible\ntruncation error, since weights wj vanish at a geometric rate [18]. The truncated DP has been shown\nto provide accurate approximation of integrals with respect to the original DP [19]. For a realisation\nP (d\u03c6) from Eqn. 7, observe that the induced distribution p(dx) over \u2126 is\n\nn(cid:88)\n\n\u221e(cid:88)\n\nj=1\n\nKernel Means via Stick Breaking The stick breaking characterisation can be used to draw from\nthe conditional DP [34]. A generic draw from [P|\u03c61:n] can be characterised as\n\n\u221e(cid:88)\n\nj=1\n\nj\u22121(cid:89)\n\nj(cid:48)=1\n\n(1 \u2212 \u03b2j(cid:48))\n\nP (d\u03c6) =\n\nwj\u03b4\u03d5j (d\u03c6),\n\nwj = \u03b2j\n\n(7)\n\nwhere randomness enters through the \u03d5j and \u03b2j as follows:\n\np(dx) =\n\nwj\u03c8(dx; \u03d5j).\n\n(8)\n\nThus we have an alternative characterisation of [p|\u03c61:n].\nOur key insight is that one can take \u03c8 and k to be a conjugate pair, such that both the kernel mean\n\u00b5(x) and the initial error p \u2297 p(k) will be available in an explicit form for the distribution in Eqn. 8\n[see Table 1 in 4, for a list of conjugate pairs]. For instance, in the one-dimensional case, consider\n\u03d5 = (\u03d51, \u03d52) and \u03c8(dx; \u03d5) = N(dx; \u03d51, \u03d52) for some location and scale parameters \u03d51 and \u03d52.\nThen for the Gaussian kernel k(x, x(cid:48)) = \u03b6 exp(\u2212(x \u2212 x(cid:48))2/2\u03bb2), the kernel mean becomes\n\n\u00b5(x) =\n\n(\u03bb2 + \u03d5j,2)1/2\n\nj=1\n\n\u03b6\u03bbwj\n\nexp\n\n(cid:16) \u2212 (x \u2212 \u03d5j,1)2\n\n2(\u03bb2 + \u03d5j,2)\n\n(cid:17)\n\nand the initial variance can be expressed as\n\np \u2297 p(k) =\n\n\u03b6\u03bbwjwj(cid:48)\n\nj=1\n\nj(cid:48)=1\n\n(\u03bb2 + \u03d5j,2 + \u03d5j(cid:48),2)1/2\n\n(cid:16) \u2212 (\u03d5j,1 \u2212 \u03d5j(cid:48),1)2\n\n2(\u03bb2 + \u03d5j,2 + \u03d5j(cid:48),2)\n\n(cid:17)\n\n.\n\nexp\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n(9)\n\n(10)\n\nSimilar calculations for the multi-dimensional case are straight-forward and provided in the Supple-\nmental Information.\n\nThe Proposed Model To put this all together, let \u03b8 denote all hyper-parameters that (a) de\ufb01ne the\nGP prior mean and covariance function, denoted m\u03b8 and k\u03b8 below, and (b) de\ufb01ne the DP prior, such\nas \u03b1 and the base distribution Pb. It is assumed that \u03b8 \u2208 \u0398 for some speci\ufb01ed set \u0398. The marginal\nposterior distribution for p(f ) in the DPMBQ model is de\ufb01ned as\n\n[p(f ) | X, f (X)] =\n\n[p(f ) | X, f (X), p, \u03b8] [dp | X, \u03b8] [d\u03b8].\n\n(11)\n\nThe \ufb01rst term in the integral is BQ for a \ufb01xed distribution p(dx). The second term represents the\nDPM model for the unknown p(dx), while the third term [d\u03b8] represents a hyper-prior distribution\nover \u03b8 \u2208 \u0398. The DPMBQ distribution in Eqn. 11 does not admit a closed-form expression. However,\nit is straight-forward to sample from this distribution without recourse to f (x) or p(dx). In particular,\nthe second term can be accessed through the law of total probabilities:\n\n(cid:90)(cid:90)\n\n(cid:90)\n\n[dp | X, \u03b8] =\n\n[dp | \u03c61:n] [\u03c61:n | X, \u03b8] d\u03c61:n\n\nwhere the \ufb01rst term [dp | \u03c61:n] is the stick-breaking construction and the term [\u03c61:n | X, \u03b8] can be\ntargeted with a Gibbs sampler. Full details of the procedure we used to sample from Eqn. 11, which\nis de-coupled from the much larger costs associated with the computer model, are provided in the\nSupplemental Information.\n\n4\n\n\fTheoretical Analysis The analysis reported below restricts attention to a \ufb01xed hyper-parameter \u03b8\nand a one-dimensional state-space \u2126 = R. The extension of theoretical results to multiple dimensions\nwas beyond the scope of this paper.\nOur aim in this section is to establish when DPMBQ is \u201cconsistent\u201d. To be precise, a random\ndistribution Pn over an unknown parameter \u03b6 \u2208 R, whose true value is \u03b60, is called consistent for \u03b60\nat a rate rn if, for all \u03b4 > 0, we have Pn[(\u2212\u221e, \u03b60 \u2212 \u03b4) \u222a (\u03b60 + \u03b4,\u221e)] = OP (rn). Below we denote\nwith f0 and p0 the respective true values of f and p; our aim is to estimate \u03b60 = p0(f0). Denote with\nH the reproducing kernel Hilbert space whose reproducing kernel is k and assume that the GP prior\nmean m is an element of H. Our main theoretical result below establishes that the DPMBQ posterior\ndistribution in Eqn. 11, which is a random object due to the n independent draws xi \u223c p(dx), is\nconsistent:\nTheorem. Let P0 denote the true mixing distribution. Suppose that:\n\n1. f belongs to H and k is bounded on \u2126 \u00d7 \u2126.\n2. \u03c8(dx; \u03d5) = N(dx; \u03d51, \u03d52).\n3. P0 has compact support supp(P0) \u2282 R \u00d7 (\u03c3, \u03c3) for some \ufb01xed \u03c3, \u03c3 \u2208 (0,\u221e).\n4. Pb has positive, continuous density on a rectangle R, s.t. supp(Pb) \u2286 R \u2286 R \u00d7 [\u03c3, \u03c3].\n5. Pb({(\u03d51, \u03d52) : |\u03d51| > t}) \u2264 c exp(\u2212\u03b3|t|\u03b4) for some \u03b3, \u03b4 > 0 and \u2200 t > 0.\n\nThen the posterior Pn = [p(f ) | X, f0(X)] is consistent for the true value p0(f0) of the integral at\nthe rate n\u22121/4+\u0001 where the constant \u0001 > 0 can be arbitrarily small.\n\nThe proof is provided in the Supplemental Information. Assumption (1) derives from results on\nconsistent BQ [4] and can be relaxed further with the results in [21] (not discussed here), while\nassumptions (2-5) derive from previous work on consistent estimation with DPM priors [14]. For\nthe case of BQ when p(dx) is known and H a Sobolev space of order s > 1/2 on \u2126 = [0, 1], the\ncorresponding posterior contraction rate is exp(\u2212Cn2s\u2212\u0001) [4, Thm. 1]. Our work, while providing\nonly an upper bound on the convergence rate, suggests that there is an increase in the fundamental\ncomplexity of estimation for p(dx) unknown compared to p(dx) known. Interestingly, the n\u22121/4+\u0001\nrate is slower than the classical Bernstein-von Mises rate n\u22121/2 [36]. However, an out-of-hand\ncomparison between these two quantities is not straight forward, as the former involves the interaction\nof two distinct non-parametric statistical models. It is known Bernstein-von Mises results can be\ndelicate for non-parametric problems [see, for example, the counter-examples in 10]. Rather, this\ntheoretical analysis guarantees consistent estimation in a regime that is non-standard.\n\n3 Results\n\nThe remainder of the paper reports empirical results from application of DPMBQ to simulated data\nand to computational cardiac models.\n\n3.1 Simulation Experiments\n\nTo explore the empirical performance of DPMBQ, a series of detailed simulation experiments were\nperformed. For this purpose, a \ufb02exible test bed was constructed wherein the true distribution p0 was\na normal mixture model (able to approximate any continuous density) and the true integrand f0 was\na polynomial (able to approximate any continuous function). In this set-up it is possible to obtain\nclosed-form expressions for all integrals p0(f0) and these served as a gold-standard benchmark.\nTo mimic the scenario of interest, a small number n of samples xi were drawn from p0(dx) and\nthe integrand values f0(xi) were obtained. This information X, f0(X) was provided to DPMBQ\nand the output of DPMBQ, a distribution over p(f ), was compared against the actual value p0(f0)\nof the integral. For all experiments in this paper the Gaussian kernel k de\ufb01ned in Sec. 2.2 was\nused; the integrand f was normalised and the associated amplitude hyper-parameter \u03b6 = 1 \ufb01xed,\nwhereas the length-scale hyper-parameter \u03bb was assigned a Gam(2, 1) hyper-prior. For the DPM, the\nconcentration parameter \u03b1 was assigned a Exp(1) hyper-prior. These choices allowed for adaptation\nof DPMBQ to the smoothness of both f and p in accordance with the data presented to the method.\nThe base distribution Pb for DPMBQ was taken to be normal inverse-gamma with hyper-parameters\n\u00b50 = 0, \u03bb0 = \u03b10 = \u03b20 = 1, selected to facilitate a simpli\ufb01ed Gibbs sampler. Full details of the\nsimulation set-up and Gibbs sampler are reported in the Supplemental Information.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 1: Simulated data results. (a) Comparison of coverage frequencies for the simulation exper-\niments. (b) Convergence assessment: Wasserstein distance (W ) between the posterior in Eqn. 11\nand the true value of the integral, is presented as a function of the number n of data points. [Circles\nrepresent independent realisations and the linear trend is shown in red.]\n\n(cid:16) \u00aff \u2212 t\u2217 s\u221a\n\n(cid:17)\n\nFor comparison, we considered the default 50% con\ufb01dence interval description of numerical error\n\n, \u00aff + t\u2217 s\u221a\n(12)\nn\nn\ni=1(f (xi)\u2212 \u00aff )2 and t\u2217 is the 50% level for a Student\u2019s\ni=1f (xi), s2 = (n\u2212 1)\u22121\u03a3n\nwhere \u00aff = n\u22121\u03a3n\nt-distribution with n \u2212 1 degrees of freedom. It is well-known that Eqn. 12 is a poor description of\nnumerical error when n is small [c.f. \u201cMonte Carlo is fundamentally unsound\u201d 27]. For example,\nwith n = 2, in the extreme case where, due to chance, f (x1) \u2248 f (x2), it follows that s \u2248 0 and no\nnumerical error is acknowledged. This fundamental problem is resolved through the use of prior\ninformation on the form of both f and p in DPMBQ. The appropriateness of DPMBQ therefore\ndepends crucially on the prior. The proposed method is further distinguished from Eqn. 12 in that the\ndistribution over numerical error is fully non-parametric, not e.g. constrained to be Student-t.\n\nconvergence in the Wasserstein distance W =(cid:82) |p(f )\u2212 p0(f0)| d[p(f ) | X, f (X)]. In particular we\n\nEmpirical Results Coverage frequencies are shown in Fig. 1a for a speci\ufb01c integration task\n(f0, p0), that was deliberately selected to be dif\ufb01cult for Eqn. 12 due to the rare event represented by\nthe mass at x = 2. These were compared against central 50% posterior credible intervals produced\nunder DPMBQ. These are the frequency with which the con\ufb01dence/credible interval contain the true\nvalue of the integral, here estimated with 100 independent realisations for DPMBQ and 1000 for the\n(less computational) standard method (standard errors are shown for both). Whilst it offers correct\ncoverage in the asymptotic limit, Eqn. 12 can be seen to be over-con\ufb01dent when n is small, with\ncoverage often less than 50%. In contrast, DPMBQ accounts for the fact p is being estimated and\nprovides conservative estimation about the extent of numerical error when n is small.\nTo present results that do not depend on a \ufb01xed coverage level (e.g. 50%), we next measured\nexplored whether the theoretical rate of n\u22121/4+\u0001 was realised. (Note that the theoretical result applied\njust to \ufb01xed hyper-parameters, whereas the experimental results reported involved hyper-parameters\nthat were marginalised, so that this is a non-trivial experiment.) Results in Fig. 1b demonstrated that\nW scaled with n at a rate which was consistent with the theoretical rate claimed. Full experimental\nresults on our polynomial test bed, reported in detail in the Supplemental Information, revealed that W\nwas larger for higher-degree polynomials (i.e. more complex integrands f), while W was insensitive\nto the number of mixture components (i.e. to more complex distributions p). The latter observation\nmay be explained by the fact that the kernel mean \u00b5 is a smoothed version of the distribution p and so\nis not expected to be acutely sensitive to variation in p itself.\n\n3.2 Application to a Computational Cardiac Model\n\nThe Model The computation model considered in this paper is due to [24] and describes the\nmechanics of the left and right ventricles through a heart beat. In brief, the model geometry (Fig. 2a,\n\n6\n\n100101102103n00.20.40.60.81cover. prob.OracleStudent-tDPMBQ-202x-0.500.5f(x)-202x024p(x)n100101102W10-1100\f(a)\n\n(b)\n\nFigure 2: Cardiac model results: (a) Computational cardiac model. A) Segmentation of the cardiac\nMRI. B) Computational model of the left and right ventricles. C) Schematic image showing the\nfeatures of pressure (left) and volume transient (right). (b) Comparison of coverage frequencies, for\neach of 10 numerical integration tasks de\ufb01ned by functionals gj of the cardiac model output.\n\ntop right) is described by \ufb01tting a C1 continuous cubic Hermite \ufb01nite element mesh to segmented\nmagnetic resonance images (MRI; Fig. 2a, top left). Cardiac electrophysiology is modelled separately\nby the solution of the mono-domain equations and provides a \ufb01eld of activation times across the heart.\nThe passive material properties and afterload of the heart are described, respectively, by a transversely\nisotropic material law and a three element Windkessel model. Active contraction is simulated using a\nphenomenological cellular model, with spatial variation arising from the local electrical activation\ntimes. The active contraction model is de\ufb01ned by \ufb01ve input parameters: tr and td are the respective\nconstants for the rise and decay times, T0 is the reference tension, a4 and a6 respectively govern the\nlength dependence of tension rise time and peak tension. These \ufb01ve parameters were concatenated\ninto a vector x \u2208 R5 and constitute the model inputs. The model is \ufb01tted based on training data\ny that consist of functionals gj : R5 \u2192 R, j = 1, . . . , 10, of the pressure and volume transient\nmorphology during baseline activation and when the heart is paced from two leads implanted in\nthe right ventricle apex and the left ventricle lateral wall. These 10 functionals are de\ufb01ned in the\nSupplemental Information; a schematic of the model and \ufb01tted measurements are shown in Fig. 2a\n(bottom panel).\n\n(cid:80)10\nTest Functions The distribution p(dx) was taken to be the posterior distribution over model\ninputs x that results from an improper \ufb02at prior on x and a squared-error likelihood function:\nj=1(yj \u2212 gj(x))2. The training data y = (y1, . . . , y10) were obtained\nlog p(x) = const. + 1\n0.12\nfrom clinical experiment. The task we considered is to compute posterior expectations for functionals\nf (x) of the model output produced when the model input x is distributed according to p(dx). This\nrepresents the situation where a \ufb01tted model is used to predict response to a causal intervention,\nrepresenting a clinical treatment. For assessment of the DPMBQ method, which is our principle aim\nin this experiment, we simply took the test functions f to be each of the physically relevant model\noutputs gj in turn (corresponding to no causal intervention). This de\ufb01ned 10 separate numerical\nintegration problems as a test bed. Benchmark values for p0(gj) were obtained, as described in\nthe Supplemental Information, at a total cost of \u2248 105 CPU hours, which would not be routinely\npractical.\n\nEmpirical Results For each of the 10 numerical integration problems in the test bed, we computed\ncoverage probabilities, estimated with 100 independent realisations (standard errors are shown),\nin line with those discussed for simulation experiments. These are shown in Fig. 2b, where we\ncompared Eqn. 12 with central 50% posterior credible intervals produced under DPMBQ. It is seen\nthat Eqn. 12 is usually reliable but can sometimes be over-con\ufb01dent, with coverage probabilities\nless than 50%. This over-con\ufb01dence can lead to spurious conclusions on the predictive performance\nof the computational model. In contrast, DPMBQ provides a uniformly conservative quanti\ufb01cation\n\n7\n\n\fof numerical error (cover. prob. \u2265 50%). The DPMBQ method is further distinguished from Eqn.\n12 in that it entails a joint distribution for the 10 integrals (the unknown p is shared across integrals\n- an instance of transfer learning across the 10 integration tasks). Fig. 2b also appears to show a\ncorrelation structure in the standard approach (black lines), but this is an artefact of the common\nsample set {xi}n\ni=1 that was used to simultaneously estimate all 10 integrals; Eqn. 12 is still applied\nindependently to each integral.\n\n4 Discussion\n\nNumerical analysis often focuses the convergence order of numerical methods, but in non-asymptotic\nregimes the language of probabilities can provide a richer, more intuitive and more useful description\nof numerical error. This paper cast the computation of integrals p(f ) as an estimation problem\namenable to Bayesian methods [20, 9, 5]. The dif\ufb01culty of this problem depends on our level of prior\nknowledge (rendering the problem trivial if a closed-form solution is a priori known) and, in the\ngeneral case, on how much information we are prepared to obtain on the objects f and p through\nnumerical computation [16]. In particular, we distinguish between three states of prior knowledge:\n(1) f known, p unknown, (2) f unknown, p known, (3) both f and p unknown. Case (1) is the\nsubject of Monte Carlo methods [32] and concerns classical problems in applied probability such as\nestimating con\ufb01dence intervals for expectations based on Markov chains. Notable recent work in this\ndirection is [8], who obtained a point estimate \u02c6p for p using a kernel smoother and then, in effect,\nused \u02c6p(f ) as an estimate for the integral. The decision-theoretic risk associated with error in \u02c6p was\nexplored in [6]. Independent of integral estimation, there is a large literature on density estimation\n[37]. Our probabilistic approach provides a Bayesian solution to this problem, as a special case of\nour more general framework. Case (2) concerns functional analysis, where [26] provide an extensive\noverview of theoretical results on approximation of unknown functions in an information complexity\nframework. As a rule of thumb, estimation improves when additional smoothness can be a priori\nassumed on the value of the unknown object [see 4]. The main focus of this paper was Case (3), until\nnow unstudied, and a transparent, general statistical method called DPMBQ was proposed.\nThe path-\ufb01nding nature of this work raises several important questions for future theoretical and\napplied research. First, these methods should be extended to account for the low-rank phenomenon\nthat is often encountered in multi-dimensional integrals [11]. Second, there is no reason, in general,\nto restrict attention to function values obtained at the locations in X. Indeed, one could \ufb01rst estimate\np(dx), then select suitable locations X(cid:48) from at which to evaluate f (X(cid:48)) [2]. This touches on aspects\nof statistical experimental design; the practitioner seeks a set X(cid:48) that minimises an appropriate loss\nfunctional at the level of p(f ); see again [6]. Third, whilst restricted to Gaussians in our experiments,\nfurther methodological work will be required to establish guidance for the choice of kernel k in the\nGP and choice of base distribution Pb in the DPM [c.f. chapter 4 of 31].\n\nAcknowledgments\n\nCJO and MG were supported by the Lloyds Register Foundation Programme on Data-Centric\nEngineering. SN was supported by an EPSRC Intermediate Career Fellowship. FXB was supported\nby the EPSRC grant [EP/L016710/1]. MG was supported by the EPSRC grants [EP/K034154/1,\nEP/R018413/1, EP/P020720/1, EP/L014165/1], and an EPSRC Established Career Fellowship,\n[EP/J016934/1]. This material was based upon work partially supported by the National Science\nFoundation (NSF) under Grant DMS-1127914 to the Statistical and Applied Mathematical Sciences\nInstitute. Opinions, \ufb01ndings, and conclusions or recommendations expressed in this material are\nthose of the author(s) and do not necessarily re\ufb02ect the views of the NSF.\n\nReferences\n[1] F Bach. On the Equivalence Between Quadrature Rules and Random Features. Journal of Machine\n\nLearning Research, 18:1\u201338, 2017.\n\n[2] F-X Briol, CJ Oates, J Cockayne, WY Chen, and M Girolami. On the sampling problem for kernel\nquadrature. In Proceedings of the 34th International Conference on Machine Learning, pages 586\u2013595,\n2017.\n\n[3] F-X Briol, CJ Oates, M Girolami, and MA Osborne. Frank-Wolfe Bayesian quadrature: Probabilistic\nintegration with theoretical guarantees. In Advances in Neural Information Processing Systems, pages\n1162\u20131170, 2015.\n\n8\n\n\f2016.\n\n1988.\n\n1986.\n\n1973.\n\n[4] F-X Briol, CJ Oates, M Girolami, MA Osborne, and D Sejdinovic. Probabilistic Integration: A Role for\n\nStatisticians in Numerical Analysis? arXiv:1512.00933, 2015.\n\n[5] J Cockayne, CJ Oates, T Sullivan, and M Girolami. Bayesian probabilistic numerical methods.\n\narXiv:1702.03673, 2017.\n\n[6] SN Cohen. Data-driven nonlinear expectations for statistical uncertainty in decisions. arXiv:1609.06545,\n\n[7] PS Craig, M Goldstein, JC Rougier, and AH Seheult. Bayesian Forecasting for Complex Systems Using\n\nComputer Simulators. Journal of the American Statistical Association, 96(454):717\u2013729, 2001.\n\n[8] B Delyon and F Portier. Integral Approximation by Kernel Smoothing. Bernoulli, 22(4):2177\u20132208, 2016.\n[9] P Diaconis. Bayesian Numerical Analysis. Statistical Decision Theory and Related Topics IV, 1:163\u2013175,\n\n[10] P Diaconis and D Freedman. On the Consistency of Bayes Estimates. Annals of Statistics, 14(1):1\u201326,\n\n[11] J Dick, FY Kuo, and IH Sloan. High-Dimensional Integration: The Quasi-Monte Carlo Way. Acta\n\nNumerica, 22:133\u2013288, 2013.\n\n[12] TS Ferguson. A Bayesian Analysis of Some Nonparametric Problems. Annals of Statistics, 1(2):209\u2013230,\n\n[13] TS Ferguson. Bayesian Density Estimation by Mixtures of Normal Distributions. Recent Advances in\n\nStatistics, 24(1983):287\u2013302, 1983.\n\n[14] S Ghosal and AW Van Der Vaart. Entropies and Rates of Convergence for Maximum Likelihood and\n\nBayes Estimation for Mixtures of Normal Densities. Annals of Statistics, 29(5):1233\u20131263, 2001.\n\n[15] T Gunter, MA Osborne, R Garnett, P Hennig, and SJ Roberts. Sampling for Inference in Probabilistic\nModels With Fast Bayesian Quadrature. In Advances in Neural Information Processing Systems, pages\n2789\u20132797, 2014.\n\n[16] P Hennig, MA Osborne, and M Girolami. Probabilistic Numerics and Uncertainty in Computations.\n\nProceedings of the Royal Society A, 471(2179):20150142, 2015.\n\n[17] F Husz\u00e1r and D Duvenaud. Optimally-Weighted Herding is Bayesian Quadrature. In Uncertainty in\n\nArti\ufb01cial Intelligence, volume 28, pages 377\u2013386, 2012.\n\n[18] H Ishwaran and LF James. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American\n\nStatistical Association, 96(453):161\u2013173, 2001.\n\n[19] H Ishwaran and M Zarepour. Exact and Approximate Sum Representations for the Dirichlet Process.\n\nCanadian Journal of Statistics, 30(2):269\u2013283, 2002.\n\n[20] JB Kadane and GW Wasilkowski. Average case epsilon-complexity in computer science: A Bayesian view.\n\nBayesian Statistics 2, Proceedings of the Second Valencia International Meeting, pages 361\u2013374, 1985.\n[21] M Kanagawa, BK Sriperumbudur, and K Fukumizu. Convergence Guarantees for Kernel-Based Quadrature\nRules in Misspeci\ufb01ed Settings. In Advances in Neural Information Processing Systems, volume 30, 2016.\n\n[22] T Karvonen and S S\u00e4rkk\u00e4. Fully symmetric kernel quadrature. arXiv:1703.06359, 2017.\n[23] MC Kennedy and A O\u2019Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical\n\nSociety: Series B, 63(3):425\u2013464, 2001.\n\n[24] AWC Lee, A Crozier, ER Hyde, P Lamata, M Truong, M Sohal, T Jackson, JM Behar, S Claridge,\nA Shetty, E Sammut, G Plank, CA Rinaldi, and S Niederer. Biophysical Modeling to Determine the\nOptimization of Left Ventricular Pacing Site and AV/VV Delays in the Acute and Chronic Phase of Cardiac\nResynchronization Therapy. Journal of Cardiovascular Electrophysiology, 28(2):208\u2013215, 2016.\n\n[25] GR Mirams, P Pathmanathan, RA Gray, P Challenor, and RH Clayton. White paper: Uncertainty and\nVariability in Computational and Mathematical Models of Cardiac Physiology. The Journal of Physiology,\n594(23):6833\u20136847, 2016.\n\n[26] E Novak and H Wo\u00b4zniakowski. Tractability of Multivariate Problems, Volume II : Standard Information\n\nfor Functionals. EMS Tracts in Mathematics 12, 2010.\n\n[27] A O\u2019Hagan. Monte Carlo is fundamentally unsound. Journal of the Royal Statistical Society, Series D,\n\n36(2/3):247\u2013249, 1987.\n\n1991.\n\n[28] A O\u2019Hagan. Bayes\u2013Hermite Quadrature. Journal of Statistical Planning and Inference, 29(3):245\u2013260,\n\n[29] M Osborne, R Garnett, S Roberts, C Hart, S Aigrain, and N Gibson. Bayesian quadrature for ratios. In\n\nArti\ufb01cial Intelligence and Statistics, pages 832\u2013840, 2012.\n\n[30] MA Osborne, DK Duvenaud, R Garnett, CE Rasmussen, SJ Roberts, and Z Ghahramani. Active learning\nof model evidence using Bayesian quadrature. In Advances in Neural Information Processing Systems,\n2012.\n\n[31] C Rasmussen and C Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n[32] C Robert and G Casella. Monte Carlo Statistical Methods. Springer Science & Business Media, 2013.\n[33] S S\u00e4rkk\u00e4, J Hartikainen, L Svensson, and F Sandblom. On the relation between Gaussian process\n\nquadratures and sigma-point methods. Journal of Advances in Information Fusion, 11(1):31\u201346, 2016.\n\n[34] J Sethuraman. A Constructive De\ufb01nition of Dirichlet Priors. Statistica Sinica, 4(2):639\u2013650, 1994.\n[35] A Smola, A Gretton, L Song, and B Sch\u00f6lkopf. A Hilbert Space Embedding for Distributions. Algorithmic\n\nLearning Theory, Lecture Notes in Computer Science, 4754:13\u201331, 2007.\n\n[36] R Von Mises. Mathematical Theory of Probability and Statistics. Academic, London, 1974.\n[37] MP Wand and MC Jones. Kernel Smoothing. CRC Press, 1994.\n\n9\n\n\f", "award": [], "sourceid": 93, "authors": [{"given_name": "Chris", "family_name": "Oates", "institution": "Newcastle University"}, {"given_name": "Steven", "family_name": "Niederer", "institution": "Kings College London"}, {"given_name": "Angela", "family_name": "Lee", "institution": "King's College London"}, {"given_name": "Fran\u00e7ois-Xavier", "family_name": "Briol", "institution": "University of Warwick"}, {"given_name": "Mark", "family_name": "Girolami", "institution": "Imperial College London"}]}