{"title": "Variational Bayesian Optimal Experimental Design", "book": "Advances in Neural Information Processing Systems", "page_first": 14036, "page_last": 14047, "abstract": "Bayesian optimal experimental design (BOED) is a principled framework for making efficient use of limited experimental resources. Unfortunately, its applicability is hampered by the difficulty of obtaining accurate estimates of the expected information gain (EIG) of an experiment. To address this, we introduce several classes of fast EIG estimators by building on ideas from amortized variational inference. We show theoretically and empirically that these estimators can provide significant gains in speed and accuracy over previous approaches. We further demonstrate the practicality of our approach on a number of end-to-end experiments.", "full_text": "Variational Bayesian Optimal Experimental Design\n\nAdam Foster\u2020\u2217 Martin Jankowiak\u2021 Eli Bingham\u2021 Paul Horsfall\u2021\n\nYee Whye Teh\u2020 Tom Rainforth\u2020 Noah Goodman\u2021\u00a7\n\u2020Department of Statistics, University of Oxford, Oxford, UK\n\n\u2021Uber AI Labs, Uber Technologies Inc., San Francisco, CA, USA\n\n\u00a7Stanford University, Stanford, CA, USA\n\nadam.foster@stats.ox.ac.uk\n\nAbstract\n\nBayesian optimal experimental design (BOED) is a principled framework for mak-\ning ef\ufb01cient use of limited experimental resources. Unfortunately, its applicability\nis hampered by the dif\ufb01culty of obtaining accurate estimates of the expected infor-\nmation gain (EIG) of an experiment. To address this, we introduce several classes\nof fast EIG estimators by building on ideas from amortized variational inference.\nWe show theoretically and empirically that these estimators can provide signi\ufb01cant\ngains in speed and accuracy over previous approaches. We further demonstrate the\npracticality of our approach on a number of end-to-end experiments.\n\n1\n\nIntroduction\n\nTasks as seemingly diverse as designing a study to elucidate human cognition, selecting the next\nquery point in an active learning loop, and designing online feedback surveys all constitute the same\nunderlying problem: designing an experiment to maximize the information gathered. Bayesian\noptimal experimental design (BOED) forms a powerful mathematical abstraction for tackling such\nproblems [8, 23, 37, 43] and has been successfully applied in numerous settings, including psychology\n[30], Bayesian optimization [16], active learning [15], bioinformatics [42], and neuroscience [38].\nIn the BOED framework, we construct a predictive model p(y|\u03b8, d) for possible experimental\noutcomes y, given a design d and a particular value of the parameters of interest \u03b8. We then choose\nthe design that optimizes the expected information gain (EIG) in \u03b8 from running the experiment,\n\n(cid:2)H[p(\u03b8)] \u2212 H[p(\u03b8|y, d)](cid:3),\n\nEIG(d) (cid:44) Ep(y|d)\n\n(1)\nwhere H[\u00b7] represents the entropy and p(\u03b8|y, d) \u221d p(\u03b8)p(y|\u03b8, d) is the posterior resulting from\nrunning the experiment with design d and observing outcome y. In other words, we seek the design\nthat, in expectation over possible experimental outcomes, most reduces the entropy of the posterior\nover our target latent variables. If the predictive model is correct, this forms a design strategy that is\n(one-step) optimal from an information-theoretic viewpoint [24, 37].\nThe BOED framework is particularly powerful in sequential contexts, where it allows the results of\nprevious experiments to be used in guiding the designs for future experiments. For example, as we\nask a participant a series of questions in a psychology trial, we can use the information gathered\nfrom previous responses to ask more pertinent questions in the future, that will, in turn, return more\ninformation. This ability to design experiments that are self-adaptive can substantially increase their\nef\ufb01ciency: fewer iterations are required to uncover the same level of information.\nIn practice, however, the BOED approach is often hampered by the dif\ufb01culty of obtaining fast and\nhigh-quality estimates of the EIG: due to the intractability of the posterior p(\u03b8|y, d), it constitutes\n\n\u2217 Part of this work was completed by AF during an internship with Uber AI Labs.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fa nested expectation problem and so conventional Monte Carlo (MC) estimation methods cannot\nbe applied [33]. Moreover, existing methods for tackling nested expectations have, in general, far\ninferior convergence rates than those for conventional expectations [22, 30, 32]. For example, nested\nMC (NMC) can only achieve, at best, a rate of O(T \u22121/3) in the total computational cost T [33],\ncompared with O(T \u22121/2) for conventional MC.\nTo address this, we propose a variational BOED approach that sidesteps the double intractability of\nthe EIG in a principled manner and yields estimators with convergence rates in line with those for\nconventional estimation problems. To this end, we introduce four ef\ufb01cient and widely applicable\nvariational estimators for the EIG. The different methods each present distinct advantages. For\nexample, two allow training with implicit likelihood models, while one allows for asymptotic\nconsistency even when the variational family does not contain the target distribution.\nWe theoretically con\ufb01rm the advantages of our estimators, showing that they all have a convergence\nrate of O(T \u22121/2) when the variational family contains the target distribution. We further verify their\npractical utility using a number of experiment design problems inspired by applications from science\nand industry, showing that they provide signi\ufb01cant empirical gains in EIG estimation over previous\nmethods and that these gains lead, in turn, to improved end-to-end performance.\nTo maximize the space of potential applications and users for our estimators, we provide2 a general-\npurpose implementation of them in the probabilistic programming system Pyro [5], exploiting Pyro\u2019s\n\ufb01rst-class support for neural networks and variational methods.\n\n2 Background\n\nThe BOED framework is a model-based approach for choosing an experiment design d in a manner\nthat optimizes the information gained about some parameters of interest \u03b8 from the outcome y of the\nexperiment. For instance, we may wish to choose the question d in a psychology trial to maximize\nthe information gained about an underlying psychological property of the participant \u03b8 from their\nanswer y to the question. In general, we adopt a Bayesian modelling framework with a prior p(\u03b8)\nand a predictive model p(y|\u03b8, d). The information gained about \u03b8 from running experiment d and\nobserving y is the reduction in entropy from the prior to the posterior:\nIG(y, d) = H[p(\u03b8)] \u2212 H[p(\u03b8|y, d)] .\n\n(2)\nAt the point of choosing d, however, we are uncertain about the outcome. Thus, in order to de\ufb01ne\na metric to assess the utility of the design d we take the expectation of IG(y, d) under the marginal\ndistribution over outcomes p(y|d) = Ep(\u03b8)[p(y|\u03b8, d)] as per (1). We can further rearrange this as\nEIG(d) = Ep(y,\u03b8|d)\n\n= Ep(y,\u03b8|d)\n\n= Ep(y,\u03b8|d)\n\np(\u03b8|y, d)\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\n(cid:20)\n\n(cid:21)\n\nlog\n\np(\u03b8)\n\np(y, \u03b8|d)\np(\u03b8)p(y|d)\n\nlog\n\np(y|\u03b8, d)\np(y|d)\n\nlog\n\n(3)\n\nwith the result that the EIG can also be interpreted as the mutual information between \u03b8 and y given\nd, or the epistemic uncertainty in y averaged over the prior p(\u03b8). The Bayesian optimal design is\nde\ufb01ned as d\u2217 (cid:44) arg maxd\u2208D EIG(d), where D is the set of permissible designs.\nComputing the EIG is challenging since neither p(\u03b8|y, d) or p(y|d) can, in general, be found in closed\nform. Consequently, the integrand is intractable and conventional MC methods are not applicable.\nOne common way of getting around this is to employ a nested MC (NMC) estimator [30, 43]\n\nN(cid:88)\n\nn=1\n\n(cid:80)M\np(yn|\u03b8n,0, d)\nm=1 p(yn|\u03b8n,m, d)\n\nlog\n\n1\nM\n\n\u02c6\u00b5NMC(d) (cid:44) 1\nN\n\nwhere \u03b8n,m\n\ni.i.d.\u223c p(\u03b8), yn\u223c p(y|\u03b8 = \u03b8n,0, d). (4)\n\nRainforth et al. [33] showed that this estimator, which has a total computational cost T = O(N M ),\nis consistent in the limit N, M \u2192 \u221e with RMSE convergence rate O(N\u22121/2 + M\u22121), and that it is\nasymptotically optimal to set M \u221d \u221a\n\nN, yielding an overall rate of O(T \u22121/3).\n\nGiven a base EIG estimator, a variety of different methods can be used for the subsequent optimization\nover designs, including some speci\ufb01cally developed for BOED [1, 29, 32]. In our experiments, we\n\n2Implementations of our methods are available at http://docs.pyro.ai/en/stable/contrib.oed.html.\nTo reproduce the results in this paper, see https://github.com/ae-foster/pyro/tree/vboed-reproduce.\n\n2\n\n\fwill adopt Bayesian optimization [39], due to its sample ef\ufb01ciency, robustness to multi-modality, and\nability to deal naturally with noisy objective evaluations. However, we emphasize that our focus is on\nthe base EIG estimator and that our estimators can be used more generally with different optimizers.\nThe static design setting we have implicitly assumed thus far in our discussion can be generalized\nto sequential contexts, in which we design T experiments d1, ..., dT with outcomes y1, ..., yT . We\nassume experiment outcomes are conditionally independent given the latent variables and designs, i.e.\n\np(y1:T , \u03b8|d1:T ) = p(\u03b8)\n\np(yt|\u03b8, dt).\n\n(5)\nHaving conducted experiments 1, ..., t \u2212 1, we can design dt by incorporating data in the standard\nBayesian fashion: at experiment iteration t, we replace the prior p (\u03b8) in (3) with p (\u03b8|d1:t\u22121, y1:t\u22121),\nthe posterior conditional on the \ufb01rst t \u2212 1 designs and outcomes. We can thus conduct an adaptive\nsequential experiment in which we optimize the choice of the design dt at each iteration.\n\nt=1\n\nT(cid:89)\n\n3 Variational Estimators\n\nThough consistent, the convergence rate of the NMC estimator is prohibitively slow for many practical\nproblems. As such, EIG estimation often becomes the bottleneck for BOED, particularly in sequential\nexperiments where the BOED calculations must be fast enough to operate in real-time.\nIn this section we show how ideas from amortized variational inference [10, 17, 34, 40] can be used\nto sidestep the double intractability of the EIG, yielding estimators with much faster convergence\nrates thereby alleviating the EIG bottleneck. A key insight for realizing why such fundamental gains\ncan be made is that the NMC estimator is inef\ufb01cient because a separate estimate of the integrand\nin (3) is made for each yn. The variational approaches we introduce instead look to directly learn a\nfunctional approximation\u2014for example, an approximation of y (cid:55)\u2192 p(y|d)\u2014and then evaluate this\napproximation at multiple points to estimate the integral, thereby allowing information to be shared\nacross different values of y. If M evaluations are made in learning the approximation, the total\ncomputational cost is now T = O(N + M ), yielding substantially improved convergence rates.\n\nVariational posterior \u02c6\u00b5post Our \ufb01rst approach, which we refer to as the variational posterior\nestimator \u02c6\u00b5post, is based on learning an amortized approximation qp(\u03b8|y, d) to the posterior p(\u03b8|y, d)\nand then using this to estimate the EIG:\n\nEIG(d) \u2248 Lpost(d) (cid:44) Ep(y,\u03b8|d)\n\nlog\n\nqp(\u03b8|y, d)\n\np(\u03b8)\n\n\u2248 \u02c6\u00b5post(d) (cid:44) 1\nN\n\nqp(\u03b8n|yn, d)\n\np(\u03b8n)\n\nlog\n\n,\n\n(6)\n\nN(cid:88)\n\nn=1\n\ni.i.d.\u223c p(y, \u03b8|d) and \u02c6\u00b5post(d) is a MC estimator of Lpost(d). We draw samples of p(y, \u03b8|d)\nwhere yn, \u03b8n\nby sampling \u03b8 \u223c p(\u03b8) and then y|\u03b8 \u223c p(y|\u03b8, d). We can think of this approach as amortizing the\ncost of the inner expectation, instead of running inference separately for each y.\nTo learn a suitable qp(\u03b8|y, d), we show in Appendix A that Lpost(d) forms a variational lower bound\nEIG(d) \u2265 Lpost(d) that is tight if and only if qp(\u03b8|y, d) = p(\u03b8|y, d). Barber and Agakov [3] used\nthis bound to estimate mutual information in the context of transmission over noisy channels, but the\nconnection to experiment design has not previously been made.\nThis result means we can learn qp(\u03b8|y, d) by introducing a family of variational distributions\nqp(\u03b8|y, d, \u03c6) parameterized by \u03c6 and then maximizing the bound with respect to \u03c6:\n\n\u03c6\u2217 = arg max\n\n\u03c6\n\nEp(y,\u03b8|d)\n\nlog\n\nqp(\u03b8|y, d, \u03c6)\n\np(\u03b8)\n\n,\n\nEIG(d) \u2248 Lpost(d; \u03c6\u2217).\n\n(7)\n\nProvided that we can generate samples from the model, this maximization can be performed using\nstochastic gradient methods [35] and the unbiased gradient estimator\n\u2207\u03c6 log qp(\u03b8i|yi, d, \u03c6) where\n\n(8)\nand we note that no reparameterization is required as p(y, \u03b8|d) is independent of \u03c6. After K\ngradient steps we obtain variational parameters \u03c6K that approximate \u03c6\u2217, which we use to compute\n\n\u2207\u03c6Lpost(d; \u03c6) \u2248 1\n\ni.i.d.\u223c p(y, \u03b8|d),\n\n(cid:88)S\n\nS\n\ni=1\n\nyi, \u03b8i\n\n3\n\n(cid:20)\n\n(cid:20)\n\n(cid:21)\n\n(cid:21)\n\n\fa corresponding EIG estimator by constructing a MC estimator for Lpost(d; \u03c6) as per (6) with\nqp(\u03b8n|yn, d) = qp(\u03b8n|yn, d, \u03c6K). Interestingly, the tightness of Lpost(d) turns out to be equal to\nthe expected forward KL divergence3 Ep(y|d) [KL (p(\u03b8|y, d)||qp(\u03b8|y, d, \u03c6))] so we can view this\napproach as learning an amortized proposal by minimizing this expected KL divergence.\n\nVariational marginal \u02c6\u00b5marg\nIn some scenarios, \u03b8 may be high-dimensional, making it dif\ufb01cult to\ntrain a good variational posterior approximation. An alternative approach that can be attractive in\nsuch cases is to instead learn an approximation qm(y|d) to the marginal density p(y|d) and substitute\nthis into the \ufb01nal form of the EIG in (3). As shown in Appendix A, this yields an upper bound\n\nEIG(d) \u2264 Umarg(d) (cid:44) Ep(y,\u03b8|d)\n\nlog\n\np(y|\u03b8, d)\nqm(y|d)\n\n\u2248 \u02c6\u00b5marg(d) (cid:44) 1\nN\n\np(yn|\u03b8n, d)\nqm(yn|d)\n\nlog\n\n,\n\n(9)\n\n(cid:20)\n\n(cid:21)\n\nN(cid:88)\n\nn=1\n\ni.i.d.\u223c p(y, \u03b8|d) and the bound is tight when qm(y|d) = p(y|d). Analogously to\nwhere again yn, \u03b8n\n\u02c6\u00b5post, we can learn qm(y|d) by introducing a variational family qm(y|d, \u03c6) and then performing\nstochastic gradient descent to minimize Umarg(d, \u03c6). As with \u02c6\u00b5post, this bound was studied in a mutual\ninformation context [31], but it has not been utilized for BOED before.\n\n(cid:96)=1\n\n1\nL\n\n(cid:35)\n\n(cid:34)\n\nEIG(d) \u2264 E\n\n(cid:44) UVNMC(d, L)\n\nlog p(y|\u03b80, d) \u2212 log\n\nVariational NMC \u02c6\u00b5VNMC As we will show in Section 4, \u02c6\u00b5post and \u02c6\u00b5marg can provide substantially\nfaster convergence rates than NMC. However, this comes at the cost of converging towards a biased\nestimate if the variational family does not contain the target distribution. To address this, we propose\nanother EIG estimator, \u02c6\u00b5VNMC, which allows one to trade-off resources between the fast learning of a\nbiased estimator permitted by variational approaches, and the ability of NMC to eliminate this bias.4\nWe can think of the NMC estimator as approximating p(y|d) using M samples from the prior. At a\nhigh-level, \u02c6\u00b5VNMC is based around learning a proposal qv(\u03b8|y, d) and then using samples from this\nproposal to make an importance sampling estimate of p(y|d), potentially requiring far fewer samples\nthan NMC. Formally, it is based around a bound that can be arbitrarily tightened, namely\n\nL(cid:88)\nwhere the expectation is taken over y, \u03b80:L \u223c p(y, \u03b80|d)(cid:81)L\n\np(y, \u03b8(cid:96)|d)\n(10)\nqv(\u03b8(cid:96)|y, d)\n(cid:96)=1 qv(\u03b8(cid:96)|y, d), which corresponds to one\nsample y, \u03b80 from the model and L samples from the approximate posterior conditioned on y. To\nthe best of our knowledge, this bound has not previously been studied in the literature. As with \u02c6\u00b5post\nand \u02c6\u00b5marg, we can minimize this bound to train a variational approximation qv(\u03b8|y, d, \u03c6). Important\nfeatures of UVNMC(d, L) are summarized in the following lemma; see Appendix A for the proof.\nLemma 1. For any given model p(\u03b8)p(y|\u03b8, d) and valid qv(\u03b8|y, d),\n1. EIG(d) = limL\u2192\u221e UVNMC(d, L) \u2264 UVNMC(d, L2) \u2264 UVNMC(d, L1) \u2200L2 \u2265 L1 \u2265 1,\n(cid:17)(cid:105)\n2. UVNMC(d, L) = EIG(d) \u2200L \u2265 1 if\n3. UVNMC(d, L)\u2212EIG(d) =Ep(y|d)\nk(cid:54)=(cid:96) qv(\u03b8k|y, d)\nKL\nLike the previous bounds, the VNMC bound is tight when qv(\u03b8|y, d) = p(\u03b8|y, d). Importantly, the\nbound is also tight as L \u2192 \u221e, even for imperfect qv. This means we can obtain asymptotically\nunbiased EIG estimates even when the true posterior is not contained in the variational family.\nSpeci\ufb01cally, we \ufb01rst train \u03c6 using K steps of stochastic gradient on UVNMC(d, L) with some \ufb01xed\nL. To form a \ufb01nal EIG estimator, however, we use a MC estimator of UVNMC(d, M ) where typically\nM (cid:29) L. This \ufb01nal estimator is a NMC estimator that is consistent as N, M \u2192 \u221e with \u03c6K \ufb01xed\n\n(cid:16)(cid:81)L\n(cid:96)=1 qv(\u03b8(cid:96)|y, d)(cid:12)(cid:12)(cid:12)(cid:12) 1\n\n(cid:80)L\n(cid:96)=1 p(\u03b8(cid:96)|y, d)(cid:81)\n\nqv(\u03b8|y, d) = p(\u03b8|y, d) \u2200y, \u03b8,\n\n(cid:104)\n\nL\n\nM(cid:88)\n\np(yn, \u03b8n,m|d)\n\n(cid:33)\n\nn=1\n\n\u02c6\u00b5VNMC(d) (cid:44) 1\nqv(\u03b8n,m|yn, d, \u03c6K)\nN\ni.i.d.\u223c p(\u03b8), yn \u223c p(y|\u03b8 = \u03b8n,0, d) and \u03b8n,m \u223c qv(\u03b8|y = yn, d, \u03c6K).\n\nlog p(yn|\u03b8n,0, d) \u2212 log\n\nwhere \u03b8n,0\nIn practice,\nperformance is greatly enhanced when the proposal qv is a good, if inexact, approximation to the\nposterior. This signi\ufb01cantly improves upon traditional \u02c6\u00b5NMC, which sets qv(\u03b8|y, d) = p(\u03b8) in (11).\n3See Appendix A for a proof. A comparison with the reverse KL divergence can be found in Appendix G.\n4In Appendix F, we describe a method using qm(y|d) as a control variate that can also eliminate this bias\n\n1\nM\n\nand lower the variance of NMC, requiring additional assumptions about the model and variational family.\n\n(11)\n\nm=1\n\n(cid:32)\n\nN(cid:88)\n\n4\n\n\fImplicit likelihood and \u02c6\u00b5m+(cid:96) So far we have assumed that we can evaluate p(y|\u03b8, d) pointwise.\nHowever, many models of interest have implicit likelihoods from which we can draw samples, but\nnot evaluate directly. For example, models with nuisance latent variables \u03c8 (such as a random effect\nmodels) are implicit likelihood models because p(y|\u03b8, d) = Ep(\u03c8|\u03b8) [p(y|\u03b8, \u03c8, d)] is intractable, but\ncan still be straightforwardly sampled from.\nIn this setting, \u02c6\u00b5post is applicable without modi\ufb01cation because it only requires samples from p(y|\u03b8, d)\nand not evaluations of this density. Although \u02c6\u00b5marg is not directly applicable in this setting, it can be\nmodi\ufb01ed to accommodate implicit likelihoods. Speci\ufb01cally, we can utilize two approximate densities:\nqm(y|d) for the marginal and q(cid:96)(y|\u03b8, d) for the likelihood. We then form the approximation\nq(cid:96)(yn|\u03b8n, d)\nqm(yn|d)\n\nEIG(d) \u2248 Im+(cid:96)(d) (cid:44) Ep(y,\u03b8|d)\n\n\u2248 \u02c6\u00b5m+(cid:96)(d) (cid:44) 1\nN\n\nq(cid:96)(y|\u03b8, d)\nqm(y|d)\n\nN(cid:88)\n\n(12)\n\n(cid:20)\n\n(cid:21)\n\nlog\n\nlog\n\nn=1\n\n.\n\nUnlike the previous three cases, Im+(cid:96)(d) is not a bound on EIG(d), meaning it is not immediately\nclear how to train qm(y|d) and q(cid:96)(y|\u03b8, d) to achieve an accurate EIG estimator. The following lemma\nshows that we can bound the EIG estimation error of Im+(cid:96). The proof is in Appendix A.\nLemma 2. For any given model p(\u03b8)p(y|\u03b8, d) and valid qm(y|d) and q(cid:96)(y|\u03b8, d), we have\n\n|Im+(cid:96)(d) \u2212 EIG(d)| \u2264 \u2212Ep(y,\u03b8|d)[log qm(y|d) + log q(cid:96)(y|\u03b8, d)] + C,\n\n(13)\nwhere C = \u2212H[p(y|d)] \u2212 Ep(\u03b8) [H(p(y|\u03b8, d)] does not depend on qm or q(cid:96). Further, the RHS of\n(13) is 0 if and only if qm(y|d) = p(y|d) and q(cid:96)(y|\u03b8, d) = p(y|\u03b8, d) for almost all y, \u03b8.\nThis lemma implies that we can learn qm(y|d) and q(cid:96)(y|\u03b8, d) by maximizing Ep(y,\u03b8|d)[log qm(y|d) +\nlog q(cid:96)(y|\u03b8, d)] using stochastic gradient ascent, and substituting these learned approximations into\n(12) for the \ufb01nal EIG estimator. To the best of our knowledge, this approach has not previously been\nconsidered in the literature. We note that, in general, qm and q(cid:96) are learned separately and there need\nnot be any weight sharing between them. See Appendix A.4 for a discussion of the case when we\ncouple qm and q(cid:96) so that qm(y|d) = Ep(\u03b8)[q(cid:96)(y|\u03b8, d)].\nUsing estimators for sequential BOED In sequential settings, we also need to consider the im-\nplications of replacing p(\u03b8) in the EIG with p(\u03b8|d1:t\u22121, y1:t\u22121). At \ufb01rst sight, it appears that,\nwhile \u02c6\u00b5marg and \u02c6\u00b5m+(cid:96) only require samples from p(\u03b8|d1:t\u22121, y1:t\u22121), \u02c6\u00b5post and \u02c6\u00b5VNMC also re-\nquire its density to be evaluated, a potentially severe limitation. Fortunately, we can, in fact,\navoid evaluating this posterior density. We note that, from (5), we have p(\u03b8|y1:t\u22121, d1:t\u22121) =\n\ni=1 p(yi|\u03b8, di)/p(y1:t\u22121|d1:t\u22121). Substituting this into the integrand of (6) gives\n\nLpost(dt) = Ep(\u03b8|y1:t\u22121,d1:t\u22121)p(yt|\u03b8,dt)\n\nlog\n\n+ log p(y1:t\u22121|d1:t\u22121)\n\n(14)\n\ni=1 p(yi|\u03b8, di) can be evaluated exactly and the additive constant log p(y1:t\u22121|d1:t\u22121)\ndoes not depend on the new design dt, \u03b8, or any of the variational parameters, and so can be safely\nignored. Making the same substitution in (11) shows that we can also estimate UVNMC(dt, L) up\nto a constant, which can then be similarly ignored. As such, any inference scheme for sampling\np(\u03b8|d1:t\u22121, y1:t\u22121), approximate or exact, is compatible with all our approaches.\n\nSelecting an estimator Having proposed\nfour estimators, we brie\ufb02y discuss how to\nchoose between them in practice. For refer-\nence, a summary of our estimators is given\nin Table 1, along with several baseline ap-\nproaches. First, \u02c6\u00b5marg and \u02c6\u00b5m+(cid:96) rely on\napproximating a distribution over y; \u02c6\u00b5post\nand \u02c6\u00b5VNMC approximate distributions over\n\u03b8. We may prefer the former two estimators\nif dim(y) (cid:28) dim(\u03b8) as it leaves us with a\nsimpler density estimation problem, and vice\nversa. Second, \u02c6\u00b5marg and \u02c6\u00b5VNMC require an\n\nTable 1: Summary of EIG estimators. Baseline meth-\nods are explained in Section 5.\n\nLower\nUpper\nUpper\n\nImplicit Bound Consistent Eq.\n(6)\n(9)\n(11)\n(12)\n(4)\n(75)\n(76)\n(77)\n\n\u0013\n\u0017\n\u0017\n\u0013\n\u0017\n\u0017\n\u0013\n\u0013\n\n\u0017\n\u0017\n\u0013\n\u0017\n\u0013\n\u0017\n\u0017\n\u0017\n\nLower\n\nUpper\n\n\u0017\n\u0017\n\n\u0017\n\ns\nr\nu\nO\n\n\u02c6\u00b5post\n\u02c6\u00b5marg\n\u02c6\u00b5VNMC\n\u02c6\u00b5m+(cid:96)\ne \u02c6\u00b5NMC\n\u02c6\u00b5laplace\n\u02c6\u00b5LFIRE\n\u02c6\u00b5DV\n\nn\ni\nl\ne\ns\na\nB\n\n5\n\np(\u03b8)(cid:81)t\u22121\nwhere p(\u03b8)(cid:81)t\u22121\n\n(cid:34)\n\n(cid:35)\n\np(\u03b8)(cid:81)t\u22121\n\nqp(\u03b8|yt, dt)\ni=1 p(yi|\u03b8, di)\n\n\fexplicit likelihood whereas \u02c6\u00b5post and \u02c6\u00b5m+(cid:96) do not. If an explicit likelihood is available, it typically\nmakes sense to use it\u2014one would never use \u02c6\u00b5m+(cid:96) over \u02c6\u00b5marg for example. Finally, if the variational\nfamilies do not contain the target densities, \u02c6\u00b5VNMC is the only method guaranteed to converge to the\ntrue EIG(d) in the limit as the computational budget increases. So we might prefer \u02c6\u00b5VNMC when\ncomputation time and cost are not constrained.\n\n4 Convergence rates\n\n(cid:124)\n\n(cid:124)\n(cid:125)\n+(cid:107)B(d, \u03c6K)\u2212B(d, \u03c6\u2217)(cid:107)2\n\n(cid:123)(cid:122)\n\nWe now investigate the convergence of our estimators. We start by breaking the overall error down into\nthree terms: I) variance in MC estimation of the bound; II) the gap between the bound and the tightest\nbound possible given the variational family; and III) the gap between the tightest possible bound and\nEIG(d). With variational EIG approximation B(d) \u2208 {Lpost(d), Umarg(d), UVNMC(d, L), Im+(cid:96)(d)},\noptimal variational parameters \u03c6\u2217, learned variational parameters \u03c6K after K stochastic gradient\niterations, and MC estimator \u02c6\u00b5(d, \u03c6K) we have, by the triangle inequality,\n(cid:125)\n(cid:107)\u02c6\u00b5(d, \u03c6K)\u2212EIG(d)(cid:107)2 \u2264 (cid:107)\u02c6\u00b5(d, \u03c6K)\u2212B(d, \u03c6K)(cid:107)2\n\n(cid:123)(cid:122)\nwhere we have used the notation (cid:107)X(cid:107)2\nBy the weak law of large numbers, term I scales as N\u22121/2 and can thus be arbitrarily reduced\nby taking more MC samples. Provided that our stochastic gradient scheme converges, term II\ncan be reduced by increasing the number of stochastic gradient steps K. Term III, however, is a\nconstant that can only be reduced by expanding the variational family (or increasing L for \u02c6\u00b5VNMC).\nEach approximation B(d) thus converges to a biased estimate of the EIG(d), namely B(d, \u03c6\u2217). As\nestablished by the following Theorem, if we set N \u221d K, the rate of convergence to this biased\nestimate is O(T \u22121/2), where T represents the total computational cost, with T = O(N + K).\nTheorem 1. Let X be a measurable space and \u03a6 be a convex subset of a \ufb01nite dimensional inner\nproduct space. Let X1, X2, ... be i.i.d. random variables taking values in X and f : X \u00d7 \u03a6 \u2192 R be\na measurable function. Let\n\n(cid:44)(cid:112)E [X 2] to denote the L2 norm of a random variable.\n\n+|B(d, \u03c6\u2217)\u2212EIG(d)|\n\n(cid:123)(cid:122)\n\nIII\n\n(cid:124)\n\nI\n\nII\n\n(cid:125)\n\n(cid:88)N\n\n\u00b5(\u03c6) (cid:44) E[f (X1, \u03c6)] \u2248 \u02c6\u00b5N (\u03c6) (cid:44) 1\nN\n\nf (Xn, \u03c6)\n\nn=1\n\nand suppose that sup\u03c6\u2208\u03a6 (cid:107)f (X1, \u03c6)(cid:107)2 < \u221e. Then sup\u03c6\u2208\u03a6 (cid:107)\u02c6\u00b5N (\u03c6) \u2212 \u00b5(\u03c6)(cid:107)2 = O(N\u22121/2). Sup-\npose further that Assumption 1 in Appendix B holds and that \u03c6\u2217 is the unique minimizer of \u00b5. After\nK iterations of the Polyak-Ruppert averaged stochastic gradient descent algorithm of [28] with\ngradient estimator \u2207\u03c6f (Xt, \u03c6), we have (cid:107)\u00b5(\u03c6K) \u2212 \u00b5(\u03c6\u2217)(cid:107)2 = O(K\u22121/2) and, combining with the\n\ufb01rst result,\n\n(cid:107)\u02c6\u00b5N (\u03c6K) \u2212 \u00b5(\u03c6\u2217)(cid:107)2 = O(N\u22121/2 + K\u22121/2) = O(T \u22121/2) if N \u221d K.\n\nThe proof relies on standard results from MC and stochastic optimization theory; see Appendix B.\nWe note that the assumptions required for the latter, though standard in the literature, are strong. In\npractice, \u03c6 can converge to a local optimum \u03c6\u2020, rather than the global optimum \u03c6\u2217, introducing an\n\nadditional asymptotic bias(cid:12)(cid:12)B(d, \u03c6\u2020) \u2212 B(d, \u03c6\u2217)(cid:12)(cid:12) into term III.\n\nTheorem 1 can be applied directly to \u02c6\u00b5marg, \u2212\u02c6\u00b5post, and \u02c6\u00b5VNMC (with \ufb01xed M = L), showing that\nthey converge respectively to Umarg(d, \u03c6\u2217), \u2212Lpost(d, \u03c6\u2217), and UVNMC(d, L, \u03c6\u2217) at a rate = O(T \u22121/2)\nif N \u221d K and the assumptions are satis\ufb01ed. For \u02c6\u00b5m+(cid:96), we combine Theorem 1 and Lemma 2 to\nobtain the same O(T \u22121/2) convergence rates; see the supplementary material for further details.\nThe key property of \u02c6\u00b5VNMC is that we need not set M = L and can remove the asymptotic bias by\nincreasing M with N. We begin by training \u03c6 with a \ufb01xed value of L, decreasing the error term\n(cid:107)UVNMC(d, L, \u03c6K)\u2212UVNMC(d, L, \u03c6\u2217)(cid:107)2 at the fast rate O(K\u22121/2) until |UVNMC(d, L, \u03c6\u2217)\u2212EIG(d)|\nconvergence results discussed in Sec. 2, if we set M \u221d \u221a\nbecomes the dominant error term. At this point, we start to increase N, M. Using the NMC\nN, then \u02c6\u00b5VNMC converges to EIG(d) at\na rate O((N M )\u22121/3). Note that the total cost of the \u02c6\u00b5VNMC estimator is T = O(KL + N M ),\nwhere typically M (cid:29) L. The \ufb01rst stage, costing KL, is fast variational training of an amortized\nimportance sampling proposal for p(y|d) = Ep(\u03b8)[p(y|\u03b8, d)]. The second stage, costing N M, is\nslower re\ufb01nement to remove the asymptotic bias using the learned proposal in an NMC estimator.\n\n6\n\n\fTable 2: Bias squared and variance from 5 runs, averaged over designs, of EIG estimators applied to\nfour benchmarks. We use - to denote that a method does not apply and \u2217 when it is superseded by\nother methods. Bold indicates the estimator with the lowest empirical mean squared error.\n\nA/B test\n\nPreference\n\nBias2\n\nVar\n\nBias2\n\nVar\n\nMixed effects\nVar\n\nBias2\n\nExtrapolation\nBias2\nVar\n\n1.33\u00d710\u22122 7.15\u00d710\u22123 4.26\u00d710\u22122 8.53\u00d710\u22123 2.34\u00d710\u22123 2.92\u00d710\u22123 1.24\u00d710\u22124 5.16\u00d710\u22125\n\u02c6\u00b5post\n7.45\u00d710\u22122 6.41\u00d710\u22123 1.10\u00d710\u22123 1.99\u00d710\u22123\n\u02c6\u00b5marg\n\u02c6\u00b5VNMC 3.44\u00d710\u22123 3.38\u00d710\u22123 4.17\u00d710\u22123 9.04\u00d710\u22123\n\u02c6\u00b5m+(cid:96)\n3.47\u00d710\u22121 7.60\u00d710\u22122 8.36\u00d710\u22122\n\u02c6\u00b5NMC\n\u02c6\u00b5laplace 1.92\u00d710\u22124 1.47\u00d710\u22123 8.42\u00d710\u22122 9.70\u00d710\u22122\n\u02c6\u00b5LFIRE\n\u02c6\u00b5DV\n\n6.20\u00d710\u22121 1.30\u00d710\u22121 1.41\u00d710\u22122 1.41\u00d710\u22121 6.67\u00d710\u22122\n8.85\u00d710\u22121 9.23\u00d710\u22122 8.07\u00d710\u22123 9.10\u00d710\u22123 5.56\u00d710\u22124 7.84\u00d710\u22126 4.11\u00d710\u22125\n\n3.06\u00d710\u22123 5.94\u00d710\u22125 6.90\u00d710\u22126 1.84\u00d710\u22125\n\n2.29\u00d7100\n4.34\u00d7100\n\n4.70\u00d7100\n\n-\n-\n\n-\n-\n\n-\n-\n\n-\n-\n-\n\n-\n-\n\n-\n-\n-\n\n-\n-\n\n-\n-\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\nOne can think of the standard NMC approach as a special case of \u02c6\u00b5VNMC in which we naively choose\np(\u03b8) as the proposal. That is, standard NMC skips the \ufb01rst stage and hence does not bene\ufb01t from the\nimproved convergence rate of learning an amortized proposal. It typically requires a much higher\ntotal cost to achieve the same accuracy as VNMC.\n\n5 Related work\n\nWe brie\ufb02y discuss alternative approaches to EIG estimation for BOED that will form our baselines for\nempirical comparisons. The Nested Monte Carlo (NMC) baseline was introduced in Sec. 2. Another\nestablished approach is to use a Laplace approximation to the posterior [22, 25]; this approach\nis fast but is limited to continuous variables and can exhibit large bias. Kleinegesse and Gutmann\n[18] recently suggested an implicit likelihood approach based on the Likelihood-Free Inference by\nRatio Estimation (LFIRE) method of Thomas et al. [41]. We also consider a method based on the\nDonsker-Varadhan (DV) representation of the KL divergence [11] as used by Belghazi et al. [4]\nfor mutual information estimation. Though not previously considered in BOED, we include it as\na baseline for illustrative purposes. For a full discussion of the DV bound and a number of other\nvariational bounds used in deep learning, we refer to the recent work of Poole et al. [31]. For further\ndiscussion of related work, see Appendix C.\n\n6 Experiments\n\n6.1 EIG estimation accuracy\n\nWe begin by benchmarking our EIG estimators against the aforementioned baselines. We consider\nfour experiment design scenarios inspired by applications of Bayesian data analysis in science and\nindustry. First, A/B testing is used across marketing and design [6, 19] to study population traits.\nHere, the design is the choice of the A and B group sizes and the Bayesian model is a Gaussian linear\nmodel. Second, revealed preference [36] is used in economics to understand consumer behaviour.\nWe consider an experiment design setting in which we aim to learn the underlying utility function of\nan economic agent by presenting them with a proposal (such as offering them a price for a commodity)\nand observing their revealed preference. Third, \ufb01xed effects and random effects (nuisance variables)\nare combined in mixed effects models [14, 20]. We consider an example inspired by item-response\ntheory [13] in psychology. We seek information only about the \ufb01xed effects, making this an implicit\nlikelihood problem. Finally, we consider an experiment where labelled data from one region of\ndesign space must be used to predict labels in a target region by extrapolation [27]. In summary, we\nhave two models with explicit likelihoods (A/B testing, preference) and two that are implicit (mixed\neffects, extrapolation). Full details of each model are presented in Appendix D.\nFor each scenario, we estimated the EIG across a grid of designs with a \ufb01xed computational budget\nfor each estimator and calculated the true EIG analytically or with brute force computation as\nappropriate; see Table 2 for the results. Whilst the Laplace method, unsurprisingly, performed best\nfor the Gaussian linear model where its approximation becomes exact, we see that our methods are\notherwise more accurate. All our methods outperformed NMC.\n\n7\n\n\f(a) Convergence in N\n\n(b) Convergence in K (c) Convergence N = K (d) Fixed budget N + K\nFigure 1: Convergence of RMSE for \u02c6\u00b5post and \u02c6\u00b5marg. (a) Convergence in number of MC samples N\nwith a \ufb01xed number K of gradient updates of the variational parameters. (b) Convergence in time\nwhen increasing K and with N \ufb01xed. (c) Convergence in time when setting N = K and increasing\nboth (dashed lines represent theoretical rates). (d) Final RMSE with N + K = 5000 \ufb01xed, for\ndifferent K. Each graph shows the mean with shading representing \u00b11 std. err. from 100 trials.\n\n6.2 Convergence rates\n\nWe now investigate the empirical convergence characteristics of our estimators. Throughout, we\nconsider a single design point from the A/B test example. We start by examining the convergence of\n\u02c6\u00b5post and \u02c6\u00b5marg as we allocate the computational budget in different ways.\nWe \ufb01rst consider the convergence in N after a \ufb01xed number of K updates to the variational parameters.\nAs shown in Figure 1a, the RMSE initially decreases as we increase N, before plateauing due to the\nbias in the estimator. We also see that \u02c6\u00b5post substantially outperforms \u02c6\u00b5marg. We next consider the\nconvergence as a function of wall-clock time when N is held \ufb01xed and we increase K. We see in\nFigure 1b that, as expected, the errors decrease with time and that when a small value of N = 5 is\ntaken, we again see a plateauing effect, with the variance of the \ufb01nal MC estimator now becoming the\nlimiting factor. In Figure 1c we take N = K and increase both, obtaining the predicted convergence\nrate O(T \u22121/2) (shown by the dashed lines). We conjecture that the better performance of \u02c6\u00b5post is\nlikely due to \u03b8 being lower dimensional (dim = 2) than y (dim = 10). In Figure 1d, we instead \ufb01x\nT = N + K to investigate the optimal trade-off between optimization and MC error: it appears the\nrange of K/T between 0.5 and 0.9 gives the lowest RMSE.\nFinally, we show how \u02c6\u00b5VNMC can improve over NMC\nby using an improved variational proposal for estimating\np(y|d). In Figure 2, we plot the EIG estimates obtained\nby \ufb01rst running K steps of stochastic gradient with L = 1\nto learn qv(\u03b8|y, d), before increasing M and N. We see\nthat spending some of our time budget training qv(\u03b8|y, d)\nleads to noticeable improvements in the estimation, but\nalso that it is important to increase N and M. Rather than\nplateauing like \u02c6\u00b5post and \u02c6\u00b5marg, \u02c6\u00b5VNMC continues to im-\nprove after the initial training period as, albeit at a slower\nO(T \u22121/3) rate.\n\n6.3 End-to-end sequential experiments\n\n\u221a\n\nFigure 2: Convergence of \u02c6\u00b5VNMC taking\nN. \u2018Steps\u2019 refers to pre-training\nM =\nof the variational posterior (i.e. K), with\n0 steps corresponding to \u02c6\u00b5NMC. Means\nand con\ufb01dence intervals as per Fig. 1.\n\nWe now demonstrate the utility of our methods for design-\ning sequential experiments. First, we demonstrate that our\nvariational estimators are suf\ufb01ciently robust and fast to\nbe used for adaptive experiments with a class of models that are of practical importance in many\nscienti\ufb01c disciplines. To this end, we run an adaptive psychology experiment with human participants\nrecruited from Amazon Mechanical Turk to study how humans respond to features of stylized faces.\nTo account for \ufb01xed effects\u2014those common across the population\u2014as well as individual variations\nthat we treat as nuisance variables, we use the mixed effects regression model introduced in Sec. 6.1.\nSee Appendix D for full details of the experiment.\nTo estimate the EIG for different designs, we use \u02c6\u00b5m+(cid:96), since it yields the best performance on our\nmixed effects model benchmark (see Table 2). Our EIG estimator is integrated into a system that\n\n8\n\n\f(a) Entropy\n\n(b) Posterior RMSE of \u03c1 (c) Posterior RMSE of \u03b1 (d) Posterior RMSE of u\n\nFigure 4: Evolution of the posterior in the sequential CES experiment. (a) Total entropy of a mean-\n\ufb01eld variational approximation of the posterior. (b)(c)(d) The RMSE of the posterior approximations\nof \u03c1, \u03b1 and u as compared to the true values used to simulate agent responses. Note the scale of the\nvertical axis is logarithmic. All plots show the mean and \u00b11 std. err. from 10 independent runs.\n\npresents participants with a stimulus, receives their response, learns an updated model, and designs\nthe next stimulus, all online. Despite the relative simplicity of the design problem (with 36 possible\ndesigns) using BOED with \u02c6\u00b5m+(cid:96) leads to a more certain (i.e. lower entropy) posterior than random\ndesign; see Figure 3.\nSecond, we consider a more challenging scenario\nin which a random design strategy gleans very lit-\ntle. We compare random design against two BOED\nstrategies: \u02c6\u00b5marg and \u02c6\u00b5NMC. Building on the revealed\npreference example in Sec. 6.1, we consider an ex-\nperiment to infer an agent\u2019s utility function which we\nmodel using the Constant Elasticity of Substitution\n(CES) model [2] with latent variables \u03c1, \u03b1, u. We\nseek designs for which the agent\u2019s response will be\ninformative about \u03b8 = (\u03c1, \u03b1, u). See Appendix D for\nfull details. We estimate the EIG using \u02c6\u00b5marg because\nthe dimension of y is smaller than that of \u03b8, and select\ndesigns d \u2208 [0, 100]6 using Bayesian optimization.\nTo investigate parameter recovery we simulate agent\nresponses from the model with \ufb01xed values of \u03c1, \u03b1, u.\nFigure 4 shows that using BOED with our marginal\nestimator reduces posterior entropy and concentrates\nmore quickly on the true parameter values than both baselines. Random design makes no inroads\ninto the learning problem, while BOED based on NMC particularly struggles at the outset when\np(\u03b8|d1:t\u22121, y1:t\u22121), the prior at iteration t, is high variance. Our method selects informative designs\nthroughout.\n\nFigure 3: Evolution of the posterior entropy\nof the \ufb01xed effects in the Mechanical Turk\nexperiment in Sec. 6.3. We depict the mean\nand \u00b11 std. err. from 10 experimental trials.\n\n7 Discussion\n\nWe have developed ef\ufb01cient EIG estimators that are applicable to a wide range of experimental design\nproblems. By tackling the double intractability of the EIG in a principled manner, they provide\nsubstantially improved convergence rates relative to previous approaches, and our experiments show\nthat these theoretical advantages translate into signi\ufb01cant practical gains. Our estimators are well-\nsuited to modern deep probabilistic programming languages and we have provided an implementation\nin Pyro. We note that the interplay between variational and MC methods in EIG estimation is not\ndirectly analogous to those in standard inference settings because the NMC EIG estimator is itself\ninherently biased. Our \u02c6\u00b5VNMC estimator allows one to play off the advantages of these approaches,\nnamely the fast learning of variational approaches and asymptotic consistency of NMC.\n\n9\n\n\fAcknowledgements\n\nWe gratefully acknowledge research funding from Uber AI Labs. MJ would like to thank Paul Szerlip\nfor help generating the sprites used in the Mechanical Turk experiment. AF would like to thank\nPatrick Rebeschini, Dominic Richards and Emile Mathieu for their help and support. AF gratefully\nacknowledges funding from EPSRC grant no. EP/N509711/1. YWT\u2019s and TR\u2019s research leading to\nthese results has received funding from the European Research Council under the European Union\u2019s\nSeventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071.\n\nReferences\n\n[1] Billy Amzal, Fr\u00e9d\u00e9ric Y Bois, Eric Parent, and Christian P Robert. Bayesian-optimal design via\ninteracting particle systems. Journal of the American Statistical association, 101(474):773\u2013785,\n2006.\n\n[2] Kenneth J Arrow, Hollis B Chenery, Bagicha S Minhas, and Robert M Solow. Capital-labor\nsubstitution and economic ef\ufb01ciency. The review of Economics and Statistics, pages 225\u2013250,\n1961.\n\n[3] David Barber and Felix Agakov. The IM algorithm: a variational approach to information\n\nmaximization. Advances in Neural Information Processing Systems, 16:201\u2013208, 2003.\n\n[4] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio,\nDevon Hjelm, and Aaron Courville. Mutual information neural estimation. In International\nConference on Machine Learning, pages 530\u2013539, 2018.\n\n[5] Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis\nKaraletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep\nuniversal probabilistic programming. The Journal of Machine Learning Research, 20(1):\n973\u2013978, 2019.\n\n[6] George EP Box, J Stuart Hunter, and William G Hunter. Statistics for experimenters. In Wiley\n\nSeries in Probability and Statistics. Wiley Hoboken, NJ, 2005.\n\n[7] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. In\n\n4th International Conference on Learning Representations, ICLR, 2016.\n\n[8] Kathryn Chaloner and Isabella Verdinelli. Bayesian experimental design: A review. Statistical\n\nScience, pages 273\u2013304, 1995.\n\n[9] Alex R Cook, Gavin J Gibson, and Christopher A Gilligan. Optimal observation times in\n\nexperimental epidemic processes. Biometrics, 64(3):860\u2013868, 2008.\n\n[10] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The Helmholtz\n\nmachine. Neural computation, 7(5):889\u2013904, 1995.\n\n[11] Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain Markov\nprocess expectations for large time. Communications on Pure and Applied Mathematics, 28(1):\n1\u201347, 1975.\n\n[12] Sylvain Ehrenfeld. Some experimental design problems in attribute life testing. Journal of the\n\nAmerican Statistical Association, 57(299):668\u2013679, 1962.\n\n[13] Susan E Embretson and Steven P Reise. Item response theory. Psychology Press, 2013.\n\n[14] Andrew Gelman, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B\n\nRubin. Bayesian data analysis. Chapman and Hall/CRC, 2013.\n\n[15] Daniel Golovin, Andreas Krause, and Debajyoti Ray. Near-optimal bayesian active learning\nwith noisy observations. In Advances in Neural Information Processing Systems, pages 766\u2013774,\n2010.\n\n10\n\n\f[16] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive\nentropy search for ef\ufb01cient global optimization of black-box functions. In Advances in neural\ninformation processing systems, pages 918\u2013926, 2014.\n\n[17] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.\n\n[18] Steven Kleinegesse and Michael U Gutmann. Ef\ufb01cient bayesian experimental design for implicit\nmodels. In The 22nd International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n476\u2013485, 2019.\n\n[19] Ron Kohavi, Roger Longbotham, Dan Sommer\ufb01eld, and Randal M Henne. Controlled experi-\nments on the web: survey and practical guide. Data mining and knowledge discovery, 18(1):\n140\u2013181, 2009.\n\n[20] John Kruschke. Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic\n\nPress, 2014.\n\n[21] Tuan Anh Le, Maximilian Igl, Tom Rainforth, Tom Jin, and Frank Wood. Auto-Encoding\nSequential Monte Carlo. International Conference on Learning Representations (ICLR), 2018.\n\n[22] Jeremy Lewi, Robert Butera, and Liam Paninski. Sequential optimal design of neurophysiology\n\nexperiments. Neural Computation, 21(3):619\u2013687, 2009.\n\n[23] Dennis V Lindley. On a measure of the information provided by an experiment. The Annals of\n\nMathematical Statistics, pages 986\u20131005, 1956.\n\n[24] Dennis V Lindley. Bayesian statistics, a review, volume 2. SIAM, 1972.\n\n[25] Quan Long, Marco Scavino, Ra\u00fal Tempone, and Suojin Wang. Fast estimation of expected in-\nformation gains for Bayesian experimental designs based on Laplace approximations. Computer\nMethods in Applied Mechanics and Engineering, 259:24\u201339, 2013.\n\n[26] Chao Ma, Sebastian Tschiatschek, Konstantina Palla, Jose Miguel Hernandez Lobato, Sebastian\nNowozin, and Cheng Zhang. EDDI: Ef\ufb01cient dynamic discovery of high-value information\nwith partial VAE. arXiv preprint arXiv:1809.11142, 2018.\n\n[27] David JC MacKay. Information-based objective functions for active data selection. Neural\n\ncomputation, 4(4):590\u2013604, 1992.\n\n[28] Eric Moulines and Francis R Bach. Non-asymptotic analysis of stochastic approximation\nalgorithms for machine learning. In Advances in Neural Information Processing Systems, pages\n451\u2013459, 2011.\n\n[29] Peter M\u00fcller. Simulation based optimal design. Handbook of Statistics, 25:509\u2013518, 2005.\n\n[30] Jay I Myung, Daniel R Cavagnaro, and Mark A Pitt. A tutorial on adaptive design optimization.\n\nJournal of mathematical psychology, 57(3-4):53\u201367, 2013.\n\n[31] Ben Poole, Sherjil Ozair, A\u00e4ron van den Oord, Alexander A Alemi, and George Tucker. On\nvariational lower bounds of mutual information. NeurIPS Workshop on Bayesian Deep Learning,\n2018.\n\n[32] Tom Rainforth. Automating Inference, Learning, and Design using Probabilistic Programming.\n\nPhD thesis, University of Oxford, 2017.\n\n[33] Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood. On\nnesting Monte Carlo estimators. In International Conference on Machine Learning, pages\n4264\u20134273, 2018.\n\n[34] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In Proceedings of the 31st International\nConference on Machine Learning, volume 32, pages 1278\u20131286, 2014.\n\n[35] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of\n\nmathematical statistics, pages 400\u2013407, 1951.\n\n11\n\n\f[36] Paul A Samuelson. Consumption theory in terms of revealed preference. Economica, 15(60):\n\n243\u2013253, 1948.\n\n[37] Paola Sebastiani and Henry P Wynn. Maximum entropy sampling and optimal Bayesian\nexperimental design. Journal of the Royal Statistical Society: Series B (Statistical Methodology),\n62(1), 2000.\n\n[38] Ben Shababo, Brooks Paige, Ari Pakman, and Liam Paninski. Bayesian inference and online\nexperimental design for mapping neural microcircuits. In Advances in Neural Information\nProcessing Systems, pages 1304\u20131312, 2013.\n\n[39] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine\nlearning algorithms. In Advances in neural information processing systems, pages 2951\u20132959,\n2012.\n\n[40] Andreas Stuhlm\u00fcller, Jacob Taylor, and Noah Goodman. Learning stochastic inverses. In\n\nAdvances in neural information processing systems, pages 3048\u20133056, 2013.\n\n[41] Owen Thomas, Ritabrata Dutta, Jukka Corander, Samuel Kaski, and Michael U Gutmann.\n\nLikelihood-free inference by ratio estimation. arXiv preprint arXiv:1611.10242, 2016.\n\n[42] Joep Vanlier, Christian A Tiemann, Peter AJ Hilbers, and Natal AW van Riel. A Bayesian\n\napproach to targeted experiment design. Bioinformatics, 28(8):1136\u20131142, 2012.\n\n[43] Benjamin T Vincent and Tom Rainforth. The DARC toolbox: automated, \ufb02exible, and ef\ufb01cient\n\ndelayed and risky choice experiments using bayesian adaptive design. 2017.\n\n12\n\n\f", "award": [], "sourceid": 7847, "authors": [{"given_name": "Adam", "family_name": "Foster", "institution": "University of Oxford"}, {"given_name": "Martin", "family_name": "Jankowiak", "institution": "Uber AI Labs"}, {"given_name": "Elias", "family_name": "Bingham", "institution": "Uber AI Labs"}, {"given_name": "Paul", "family_name": "Horsfall", "institution": "Uber AI Labs"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}, {"given_name": "Thomas", "family_name": "Rainforth", "institution": "University of Oxford"}, {"given_name": "Noah", "family_name": "Goodman", "institution": "Stanford University"}]}