{"title": "Observational-Interventional Priors for Dose-Response Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1561, "page_last": 1569, "abstract": "Controlled interventions provide the most direct source of information for learning causal effects. In particular, a dose-response curve can be learned by varying the treatment level and observing the corresponding outcomes. However, interventions can be expensive and time-consuming. Observational data, where the treatment is not controlled by a known mechanism, is sometimes available. Under some strong assumptions, observational data allows for the estimation of dose-response curves. Estimating such curves nonparametrically is hard: sample sizes for controlled interventions may be small, while in the observational case a large number of measured confounders may need to be marginalized. In this paper, we introduce a hierarchical Gaussian process prior that constructs a distribution over the dose-response curve by learning from observational data, and reshapes the distribution with a nonparametric affine transform learned from controlled interventions. This function composition from different sources is shown to speed-up learning, which we demonstrate with a thorough sensitivity analysis and an application to modeling the effect of therapy on cognitive skills of premature infants.", "full_text": "Observational-Interventional Priors for\n\nDose-Response Learning\n\nDepartment of Statistical Science and Centre for Computational Statistics and Machine Learning\n\nRicardo Silva\n\nUniversity College London\nricardo@stats.ucl.ac.uk\n\nAbstract\n\nControlled interventions provide the most direct source of information for learning\ncausal effects. In particular, a dose-response curve can be learned by varying the\ntreatment level and observing the corresponding outcomes. However, interventions\ncan be expensive and time-consuming. Observational data, where the treatment is\nnot controlled by a known mechanism, is sometimes available. Under some strong\nassumptions, observational data allows for the estimation of dose-response curves.\nEstimating such curves nonparametrically is hard: sample sizes for controlled\ninterventions may be small, while in the observational case a large number of\nmeasured confounders may need to be marginalized. In this paper, we introduce\na hierarchical Gaussian process prior that constructs a distribution over the dose-\nresponse curve by learning from observational data, and reshapes the distribution\nwith a nonparametric af\ufb01ne transform learned from controlled interventions. This\nfunction composition from different sources is shown to speed-up learning, which\nwe demonstrate with a thorough sensitivity analysis and an application to modeling\nthe effect of therapy on cognitive skills of premature infants.\n\n1 Contribution\n\nWe introduce a new solution to the problem of learning how an outcome variable Y varies under\ndifferent levels of a control variable X that is manipulated. This is done by coupling different\nGaussian process priors that combine observational and interventional data. The method outperforms\nestimates given by using only observational or only interventional data in a variety of scenarios and\nprovides an alternative way of interpreting related methods in the design of computer experiments.\nMany problems in causal inference [14] consist of having a treatment variable X and and outcome Y ,\nand estimating how Y varies as we control X at different levels. If we have data from a randomized\ncontrolled trial, where X and Y are not confounded, many standard modeling approaches can be\nused to learn the relationship between X and Y . If X and Y are measured in an observational study,\nthe corresponding data can be used to estimate the association between X and Y , but this may not be\nthe same as the causal relationship of these two variables because of possible confounders.\nTo distinguish between the observational regime (where X is not controlled) and the interventional\nregime (where X is controlled), we adopt the causal graphical framework of [16] and [19]. In Figure\n1 we illustrate the different regimes using causal graphical models. We will use p(\u00b7 | \u00b7) to denote\n(conditional) density or probability mass functions. In Figure 1(a) we have the observational, or\n\u201cnatural,\u201d regime where common causes Z generate both treatment variable X and outcome variable\nY . While the conditional distribution p(Y = x | X = x) can be learned from this data, this quantity is\nnot the same as p(Y = y | do(X = x)): the latter notation, due to Pearl [16], denotes a regime where\nX is not random, but a quantity set by an intervention performed by an external agent. The relation\nbetween these regimes comes from fundamental invariance assumptions: when X is intervened upon,\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: Graphs representing causal graphical models. Circles represent random variables, squares\nrepresent \ufb01xed constants. (a) A system where Z is a set of common causes (confounders), common\nparents of X and Y here represented as a single vertex. (b) An intervention overrides the value of X\nsetting it to some constant. The rest of the system remains invariant. (c) ZO is not a common cause\nof X and Y , but blocks the in\ufb02uence of confounder ZH.\n\n\u201call other things are equal,\u201d and this invariance is re\ufb02ected by the fact that the model in Figure 1(a)\nand Figure 1(b) share the same conditional distribution p(Y = x|X = x, Z = z) and marginal\ndistribution p(Z = z). If we observe Z, p(Y = y | do(X = x)) can be learned from observational\ndata, as we explain in the next section.\nOur goal is to learn the relationship\n\nf (x) \u2261 E[Y | do(X = x)], x \u2208 X ,\n\n(1)\nwhere X \u2261 {x1, x2, . . . , xT} is a pre-de\ufb01ned set of treatment levels. We call the vector f (X ) \u2261\n[f (x1); . . . ; f (xT )](cid:62) the response curve for the \u201cdoses\u201d X . Although the term \u201cdose\u201d is typically\nassociated with the medical domain, we adopt here the term dose-response learning in its more\ngeneral setup: estimating the causal effect of a treatment on an outcome across different (quantitative)\nlevels of treatment. We assume that the causal structure information is known, complementing\napproaches for structure learning [19, 9] by tackling the quantitative side of causal prediction.\nIn Section 2, we provide the basic notation of our setup. Section 3 describes our model family.\nSection 4 provides a thorough set of experiments assessing our approach, including sensitivity to\nmodel misspeci\ufb01cation. We provide \ufb01nal conclusions in Section 5.\n\n(cid:90)\n\n2 Background\nThe target estimand p(Y = y | do(X = x)) can be derived from the structural assumptions of Figure\n1(b) by standard conditioning and marginalization operations:\n\np(Y = y | do(X = x)) =\n\np(Y = y | X = x, Z = z)p(Z = z) dz.\n\n(2)\nNotice the important difference between the above and p(Y = y | X = x), which can be derived from\nthe assumptions in Figure 1(a) by marginalizing over p(Z = z | X = x) instead. The observational\nand interventional distributions can be very different. The above formula is sometimes known as the\nback-door adjustment [16] and it does not require measuring all common causes of treatment and\noutcome. It suf\ufb01ces that we measure variables Z that block all \u201cback-door paths\u201d between X and Y ,\na role played by ZO in Figure 1(c). A formal description of which variables Z will validate (2) is\ngiven by [20, 16, 19]. We will assume that the selection of which variables Z to adjust for has been\ndecided prior to our analysis, although in our experiments in Section 4 we will assess the behavior\nof our method under model misspeci\ufb01cation. Our task is to estimate (1) nonparametrically given\nobservational and experimental data, assuming that Z satis\ufb01es the back-door criteria.\nOne possibility for estimating (1) from observational data Dobs \u2261 {(Y (i), X (i), Z(i))}, 1 \u2264 i \u2264 N,\nis by \ufb01rst estimating g(x, z) \u2261 E[Y | X = x, Z = z]. The resulting estimator,\n\nN(cid:88)\n\ni=1\n\n2\n\n\u02c6f (x) \u2261 1\nN\n\n\u02c6g(x, z(i)),\n\n(3)\n\nis consistent under some general assumptions on f (\u00b7) and g(\u00b7,\u00b7). Estimating g(\u00b7,\u00b7) nonparametrically\nseems daunting, since Z can in principle be high-dimensional. However, as shown by [5], under\n\nXYZXYZXYZOZH\fsome conditions the problem of estimating \u02c6f (\u00b7) nonparametrically via (3) is no harder than a one-\ndimensional nonparametric regression problem. There is however one main catch: while observational\ndata can be used to choose the level of regularization for \u02c6g(\u00b7), this is not likely to be an optimal choice\nfor \u02c6f (\u00b7) itself. Nevertheless, even if suboptimal smoothing is done, the use of nonparametric methods\nfor estimating causal effects by back-door adjustment has been successful. For instance, [7] uses\nBayesian classi\ufb01cation and regression trees for this task.\nAlthough of practical use, there are shortcomings in this idea even under the assumption that Z\nprovides a correct back-door adjustment. In particular, Bayesian measures of uncertainty should be\ninterpreted with care: a fully Bayesian equivalent to (3) would require integrating over a model for\np(Z) instead of the empirical distribution for Z in Dobs; evaluating a dose x might require combining\nmany g(x, z(i)) where the corresponding training measurements x(i) are far from x, resulting on\npossibly unreliable extrapolations with poorly calibrated credible intervals. While there are well\nestablished approaches to deal with this \u201clack of overlap\u201d problem in binary treatments or linear\nresponses [18, 8], it is less clear what to do in the continuous case with nonlinear responses.\nIn this paper, we focus on a setup where it is possible to collect interventional data such that treatments\nare controlled, but where sample sizes might be limited due to \ufb01nancial and time costs. This is related\nto design of computer experiments, where (cheap, but biased) computer simulations are combined\nwith \ufb01eld experiments [2, 6]. The key idea of combining two sources of data is very generic, the\nvalue of new methods being on the design of adequate prior families. For instance, if computer\nsimulations are noisy, it is may not be clear how uncertainty at that level should be modeled. We\nleverage knowledge of adjustment techniques for causal inference, so that it provides a partially\nautomated recipe to transform observational data into informed priors. We leverage knowledge\nof the practical shortcomings of nonparametric adjustment (3) so that, unlike the biased but low\nvariance setup of computer experiments, we try to improve the (theoretically) unbiased but possibly\noversmooth structure of such estimators by introducing a layer of pointwise af\ufb01ne transformations.\nHeterogeneous effects and strati\ufb01cation. One might ask why marginalize Z in (2), as it might be\nof greater interest to understand effects at the \ufb01ner subpopulation levels conditioned on Z. In fact, (2)\nshould be seen as the most general case, where conditioning on a subset of covariates (for instance,\ngender) will provide the possibly different average causal effect for each given strata (different levels\nof gender) marginalized over the remaining covariates. Randomized \ufb01ne-grained effects might be\nhard to estimate and require stronger smoothing and extrapolation assumptions, but in principle\nthey could be integrated with the approaches discussed here. In practice, in causal inference we are\ngenerally interested in marginal effects for some subpopulations where many covariates might not be\npractically measurable at decision time, and for the scienti\ufb01c purposes of understanding total effects\n[5] at different levels of granularity with weaker assumptions.\n\n3 Hierarchical Priors via Inherited Smoothing and Local Af\ufb01ne Changes\n\nThe main idea is to \ufb01rst learn from observational data a Gaussian process over dose-response curves,\nthen compose it with a nonlinear transformation biased toward the identity function. The fundamental\ninnovation is the construction a nonstationary covariance function from observational data.\n\n3.1 Two-layered Priors for Dose-responses\nGiven an observational dataset Dobs of size N, we \ufb01t a Gaussian process to learn a regression model\nof outcome Y on (uncontrolled) treatment X and covariates Z. A Gaussian likelihood for Y given X\nand Z is adopted, with conditional mean g(x, z) and variance \u03c32\ng. A Mat\u00e9rn 3/2 covariance function\nwith automatic relevance determination priors is given to g(\u00b7,\u00b7), followed by marginal maximum\nlikelihood to estimate \u03c32\ng and the covariance hyperparameters [12, 17]. This provides a posterior\ndistribution over functions g(\u00b7,\u00b7) in the input space of X and Z. We then de\ufb01ne fobs(X ), x \u2208 X , as\n\nN(cid:88)\n\ni=1\n\nfobs(x) \u2261 1\nN\n\ng(x, z(i)),\n\n(4)\n\nwhere set {g(x, z(i))} is unknown. Uncertainty about fobs(\u00b7) comes from the joint predictive distri-\nbution of {g(x, z(i))} learned from Dobs, itself a Gaussian distribution with a T N \u00d7 1 mean vector\n\n3\n\n\ffobs(X ) \u223c N (\u00b5obs, Kobs)\n\na(X ) \u223c N (1, Ka)\nb(X ) \u223c N (0, Kb)\nf (X ) = a(X ) (cid:12) fobs(X ) + b(X )\nint \u223c N (f (x(i)\nY (i)\n\nint), \u03c32\n\nint), 1 \u2264 i \u2264 M,\n\n(5)\n\n(cid:80)N\n\ng and a T N \u00d7 T N covariance matrix, T \u2261 |X|. Since (4) is a linear function of {g(x, z(i))}, this\n\u00b5(cid:63)\nimplies fobs(X ) is also a (nonstationary) Gaussian process with mean \u00b5obs(x) = 1\ng(x, z(i))\nfor each x \u2208 X . The motivation for (4) is that \u00b5obs is an estimator of the type (3), inheriting its\ndesirable properties and caveats.\nThe cost of computing the covariance matrix Kobs of fobs(X ) is O(T 2N 2), potentially expensive. In\nmany practical applications, however, the size of X is not particularly large as it is a set of intervention\npoints to be decided according to practical real-world constraints. In our simulations in Section 4, we\nchose T = |X| = 20. Approximating such covariance matrix, if necessary, is a future research topic.\nAssume interventional data Dint \u2261 {(Y (i)\nint)}, 1 \u2264 i \u2264 M, is provided (with assignments x(i)\nchosen by some pre-de\ufb01ned design in X ). We assign a prior to f (\u00b7) according to the model\n\nint , x(i)\n\ni=1 \u00b5(cid:63)\n\nint\n\nN\n\nwhere N (m, V) is the multivariate normal distribution with mean m and covariance matrix V, (cid:12) is\nthe elementwise product, a(\u00b7) is a vector which we call the distortion function, and b(\u00b7) the translation\nfunction. The role of the \u201celementwise af\ufb01ne\u201d transform a (cid:12) fobs + b is to bias f toward fobs with\nuncertainty that varies depending on our uncertainty about fobs. The multiplicative component\na (cid:12) fobs also induces a heavy-tail prior on f. In the Supplementary Material, we discuss brie\ufb02y the\nalternative of using the deep Gaussian process of [4] in our observational-interventional setup.\n\n(cid:19)\n\n(cid:115)\n\n3.2 Hyperpriors\nWe parameterize Ka as follows. Every entry ka(x, x(cid:48)) of Ka, (x, x(cid:48)) \u2208 X \u00d7 X , assumes the shape\nof a squared exponential kernel modi\ufb01ed according to the smoothness and scale information obtained\nfrom Dobs. First, de\ufb01ne ka(x, x(cid:48)) as\n\nka(x, x(cid:48)) \u2261 \u03bba \u00d7 vx \u00d7 vx(cid:48) \u00d7 exp\n\n(6)\nwhere (\u03bba, \u03c3h) are hyperparameters, \u03b4(\u00b7) is the delta function, vx is a rescaling of Kobs(x, x)1/2,\n\u02c6x is a rescaling of X to the [0, 1] interval, \u02c6yx is a rescaling of \u00b5obs(x) to the [0, 1] interval. More\nprecisely,\n\n+ \u03b4(x \u2212 x(cid:48))10\u22125,\n\n\u2212 1\n2\n\n\u03c3a\n\n(\u02c6x \u2212 \u02c6x(cid:48))2 + (\u02c6yx \u2212 \u02c6yx(cid:48))2\n\n(cid:18)\n\n\u02c6x \u2261\n\nx \u2212 min(X )\n\nmax(X ) \u2212 min(X )\n\n, \u02c6yx \u2261\n\n\u00b5obs(x) \u2212 min(\u00b5obs(X ))\n\nmax(\u00b5obs(X )) \u2212 min(\u00b5obs(X ))\n\n, vx =\n\nKobs(x, x)\n\nmaxx(cid:48) Kobs(x(cid:48), x(cid:48))\n\n.\n\n(7)\nEquation (6) is designed to borrow information from the (estimated) smoothness of f (X ), by\ndecreasing the correlation of the distortion factors a(x) and a(x(cid:48)) as a function of the Euclidean\ndistance between the 2D points (x, \u00b5obs(x)) and (x(cid:48), \u00b5obs(x(cid:48))), properly scaled. Hyperparameter\n\u03c3a controls how this distance is weighted.\n(6) also captures information about the amplitude\nof the distortion signal, making it proportional to the ratios of the diagonal entries of Kobs(X ).\nHyperparameter \u03bba controls how this amplitude is globally adjusted. Nugget 10\u22125 brings stability to\nthe sampling of a(X ) within Markov chain Monte Carlo (MCMC) inference. Hyper-hyperpriors on\n\u03bba and \u03c3a are set as\n\n(8)\nThat is, \u03bba follows a log-Normal distribution with median 1, approximately 90% of the mass below\n2.5, and a long tail to the right. The implied distribution for a(x) where sx = 1 will have most\nof its mass within a factor of 10 from its median. The prior on \u03c3a follows a similar shape, but\nwith a narrower allocation of mass. Covariance matrix Kb is de\ufb01ned in the same way, with its own\nhyperparameters \u03bbb and \u03c3b. Finally, the usual Jeffrey\u2019s prior for error variances is given to \u03c32\nFigure 2 shows an example of inference obtained from synthetic data, generated according to the\nprotocol of Section 4. In this example, the observational relationship between X and Y has the\nopposite association of the true causal one, but after adjusting for 15 of the 25 confounders that\n\nlog(\u03bba) \u223c N (0, 0.5),\n\nlog(\u03c3a) \u223c N (0, 0.1).\n\nint.\n\n4\n\n\fFigure 2: An example with synthetic data (|Z| = 25), from priors to posteriors. Figure best seen in\ncolor. Top row: scatterplot of observational data, with true dose-response function in solid green,\nadjusted \u00b5obs in dashed red, and the unadjusted Gaussian process regression of Y on X in dashed-\nand-circle magenta (which is a very badly biased estimate in this example); scatterplot in the middle\nshows interventional data, 20 dose levels uniformly spread in the support of the observational data\nand 10 outputs per level \u2212 notice that the sign of the association is the opposite of the observational\nregime; matrix Kobs is depicted at the end, where the nonstationarity of the process is evident. Middle\nrow: priors constructed on fobs(X ) and a(X ) with respective means; plot at the end corresponds to\nthe implied prior on a (cid:12) fobs + b. Bottom row: the respective posteriors obtained by Gibbs sampling.\n\ngenerated the data (10 confounders are randomly ignored to mimic imperfect prior knowledge), a\nreasonable initial estimate for f (X ) is obtained. The combination with interventional data results\nin a much better \ufb01t, but imperfections still exist at the strongest levels of treatment: the green curve\ndrops at x > 2 stronger than the expected posterior mean. This is due to having both a prior derived\nfrom observational data that got the wrong direction of the dose-response curve at x > 1.5, and being\nunlucky at drawing several higher than expected values in the interventional regime for x = 3. The\nmodel then shows its strength on capturing much of the structure of the true dose-response curve\neven under misspeci\ufb01ed adjustments, but the example provides a warning that only so much can be\ndone given unlucky draws from a small interventional dataset.\n\n3.3\n\nInference, Strati\ufb01ed Learning and Active Learning\n\nIn our experiments, we infer posterior distributions by Gibbs sampling, alternating the sampling of\nlatent variables f (X ), a(X ), b(X ) and hyperparameters \u03bba, \u03c3a, \u03bbb, \u03c3b, \u03c32\nint, using slice sampling\n[15] for the hyperparameters. The meaning of the individual posterior distribution over fobs(X )\nmight also be of interest. This quantity is potentially identi\ufb01able by considering a joint model\nsuggests that the posterior distribution for fobs(X ) will change little according to model (5), which\n\nfor (Dobs,Dint): in this case, fobs(X ) learns the observational adjustment(cid:82) g(x, z)p(z) dz. This\n\n5\n\n\u22124\u22123\u22122\u221210123\u22124\u22123\u22122\u2212101234Treatment XOutcome YObservational data (N = 1000)\u22124\u22123\u22122\u221210123\u22124\u22123\u22122\u2212101234Treatment XOutcome YInterventional data (M = 200)Kobs24681012141618202468101214161820\u22124\u22123\u22122\u221210123\u22121\u22120.500.511.522.53Treatment XOutcome YPrior: observational only\u22124\u22123\u22122\u221210123\u22126\u22124\u221220246810Treatment XDistortion HPrior: distortion only\u22124\u22123\u22122\u221210123\u22121\u22120.500.511.522.53Treatment XOutcome YPrior on dose\u2212response\u22124\u22123\u22122\u221210123\u22121\u22120.500.511.522.53Treatment XOutcome YPosterior: observational only\u22124\u22123\u22122\u221210123\u22122\u2212101234Treatment XOutcome YPosterior: distortion only\u22124\u22123\u22122\u221210123\u22121\u22120.500.511.522.53Treatment XOutcome YPosterior on dose\u2212response\fis indeed observed in practice and illustrated by Figure 2. Learning the hyperparameters for Kobs\ncould be done jointly with the remaining hyperparameters, but the cost per iteration would be high\ndue to the update of Kobs. The MCMC procedure for (5) is relatively inexpensive assuming that |X|\nis small. Learning the hyperparameters of Kobs separately is a type of \u201cmodularization\u201d of Bayesian\ninference [10].\nAs we mentioned in Section 2, it is sometimes desirable to learn dose-response curves conditioned\non a few covariates S \u2282 Z of interest. In particular, in this paper we will consider the case of\nstraightforward strati\ufb01cation: given a set S of discrete covariates assuming instantiations s, we have\nfunctions f s(X ) to be learned. Different estimation techniques can be used to borrow statistical\nstrength across levels of S, both for f s(X ) and f s\nobs(X ). However, in our implementation, where we\nassume |S| is very small (a realistic case for many experimental designs), we construct independent\npriors for the different f s\nFinally, in the Supplementary Material we also consider simple active learning schemes [11], as\nsuggested by the fact that prior information already provides different estimates of uncertainty across\nX (Figure 2), which is sometimes dramatically nonstationary.\n\nobs(X ) with independent af\ufb01ne transformations.\n\n4 Experiments\n\nAssessing causal inference algorithms requires \ufb01tting and predicting data generated by expensive\nrandomized trials. Since this is typically unavailable, we will use simulated data where the truth is\nknown. We divide our experiments in two types: \ufb01rst, one where we generate random dose-response\nfunctions, which allows us to control the dif\ufb01culty of the problem in different directions; second, one\nwhere we start from a real world dataset and generate \u201crealistic\u201d dose-response curves from which\nsimulated data can be given as input to the method.\n\n4.1 Synthetic Data Studies\nWe generate studies where the observational sample has N = 1000 data points and |Z| = 25\nconfounders. Interventional data is generated at three different levels of sample size, M = 40,\n100 and 200 where the intervention space X is evenly distributed within the range shown by the\nobservational data, with |X| = 20. Covariates Z are generated from a zero-mean, unit variance\nGaussian with correlation of 0.5 for all pairs. Treatment X is generated by \ufb01rst sampling a function\nfi(zi) for every covariate from a Gaussian process, summing over 1 \u2264 i \u2264 25 and adding Gaussian\nnoise. Outcome Y is generated by \ufb01rst sampling linear coef\ufb01cients and one intercept to weight the\ncontribution of confounders Z, and then passing the linear combination through a quadratic function.\nThe dose-response function of X on Y is generated as a polynomial, which is added to the contribution\nof Z and a Gaussian error. In this way, it is easy to obtain the dose-response function analytically.\nBesides varying M, we vary the setup in three other aspects: \ufb01rst, the dose-response is either a\nquadratic or cubic polynomial; second, the contribution of X is scaled to have its minimum and\nmaximum value spam either 50% or 80% of the range of all other causes of Y , including the Gaussian\nnoise (a spam of 50% already generates functions of modest impact to the total variability of Y );\nthird, the actual data given to the algorithm contains only 15 of the 25 confounders. We either\ndiscard 10 confounders uniformly at random (the RANDOM setup), or remove the \u201ctop 10 strongest\u201d\nconfounders, as measured by how little confounding remains after adjusting for that single covariate\nalone (the ADVERSARIAL setup). In the interest of space, we provide a fully detailed description of\nthe experimental setup in the Supplementary Material. Code is also provided to regenerate our data\nand re-run all of these experiments.\nEvaluation is done in two ways. First, by the normalized absolute difference between an estimate\n\u02c6f (x) and the true f (x), averaged over X . The normalization is done by dividing the difference by the\ngap between the maximum and minimum true values of f (X ) within each simulated problem1. The\nsecond measure is the log density of each true f (x), averaged over x \u2208 X , according to the inferred\nposterior distribution approximated as a Gaussian distribution, with mean and variance estimated by\nMCMC. We compare our method against: I. a variation of it where a and b are \ufb01xed at 1 and 0, so the\nonly randomness is in fobs; II. instead of an af\ufb01ne transformation, we set f (X ) = fobs(X ) + r(X ),\n1Data is also normalized to a zero mean, unit variance according to the empirical mean and variance of the\n\nobservational data, in order to reduce variability across studies.\n\n6\n\n\fTable 1: For each experiment, we have either quadratic (Q) or cubic (C) ground truth, with a signal\nrange of 50% or 80%, and an interventional sample size of M = 40, 100 and 200. Ei denotes the\ndifference between competitor i and our method regarding mean error, see text for a description of\ncompetitors. The mean absolute error for our method is approximately 0.20 for M = 40 and 0.10 for\nM = 200 across scenarios. Li denotes the difference between our method and competitor i regarding\nlog-likelihood (differences > 10 are ignored, see text). That is, positive values indicate our method\nis better according to the corresponding criterion. All results are averages over 50 independent\nsimulations, italics indicate statistically signi\ufb01cant differences by a two-tailed t-test at level \u03b1 = 0.05.\n\nQ50% ADV\n\nQ50% RANDOM\n40\n200\n40\n100\nEI 0.00\n0.01\n0.07\n0.02\nEII 0.05\n0.04\n0.02\n0.01\nEIII 0.11\n0.03\n0.05\n0.07\nLI 2.33\n7.16\n2.18\n2.31\nLII 0.78\n0.17\n0.44\n0.28\nLIII > 10 > 10\n0.43 > 10\n\n100\n0.07\n0.00\n0.01\n6.68\n-0.17\n> 10\n\nC50% ADV\n\nC50% RANDOM\n200\n40\n100\nEI 0.01\n0.03\n0.08\nEII 0.05\n0.02\n0.02\nEIII 0.08\n0.04\n0.04\nLI > 10 > 10 > 10\n9.05\nLII 3.49\n0.41\n0.43\nLIII > 10 > 10 > 10 > 10 > 10\n\n40\n0.08\n0.05\n0.03\n9.62\n4.45\n\n100\n0.02\n0.03\n0.04\n\n0.83\n\n200\n0.05\n0.00\n0.01\n6.23\n-0.16\n-0.06\n\n200\n0.07\n0.01\n0.02\n8.68\n-0.10\n> 10\n\n100\n0.00\n0.03\n0.06\n0.53\n0.42\n0.44\n\nQ80% RANDOM\n40\n200\n0.01\n0.00\n0.02\n0.04\n0.03\n0.11\n0.45\n0.62\n0.53\n0.20\n0.74\n0.36\nC80% RANDOM\n100\n200\n40\n0.05\n0.05\n0.03\n0.03\n0.05\n0.02\n0.03\n0.11\n0.06\n> 10\n> 10 > 10\n0.64\n1.07\n-0.04\n> 10\n0.79\n0.03\n\nQ80% ADV\n\n40\n0.05\n0.04\n0.08\n2.16\n0.25\n0.33\n\n100\n0.04\n0.02\n0.03\n1.79\n0.07\n-0.01\n\nC80% ADV\n\n100\n40\n0.09\n0.09\n0.03\n0.07\n0.09\n0.05\n> 10 > 10\n0.96\n0.30\n0.45\n0.18\n\n200\n0.03\n0.00\n0.01\n1.50\n-0.09\n-0.10\n\n200\n0.08\n0.02\n0.02\n> 10\n0.14\n-0.03\n\nwhere r is given a generic squared exponential Gaussian process prior, which is \ufb01t by marginal\nmaximum likelihood; III. Gaussian process regression with squared exponential kernel applied to\nthe interventional data only and hyperparameters \ufb01tted by marginal likelihood. The idea is that\ncompetitors I and II provide sensitivity analysis of whether our more specialized prior is adding\nvalue. In particular, competitor II would be closer to the traditional priors used in computer-aided\nexperimental design [2] (but for our specialized Kobs). Results are shown in Table 1, according to\nthe two assessment criteria, using E for average absolute error, and L for average log-likelihood.\nOur method demonstrated robustness to varying degrees of unmeasured confounding. Compared to\nCompetitor I, the mean obtained without any further af\ufb01ne transformation already provides a compet-\nitive estimator of f (X ), but this suffers when unmeasured confounding is stronger (ADVERSARIAL\nsetup). Moreover, uncertainty estimates given by Competitor I tend to be overcon\ufb01dent. Competitor\nII does not make use of our special covariance function for the correction, and tends to be particularly\nweak against our method in lower interventional sample sizes. In the same line, our advantage over\nCompetitor III starts stronger at M = 40 and diminishes as expected when M increases. Competitor\nIII is particularly bad at lower signal-to-noise ratio problems, where sometimes it is overly con\ufb01dent\nthat f (X ) is zero everywhere (hence, we ignore large likelihood discrepancies in our evaluation).\nThis suggests that in order to learn specialized curves for particular subpopulations, where M will\ninvariably be small, an end-to-end model for observational and interventional data might be essential.\n\n4.2 Case Study\n\nWe consider an adaptation of the study analyzed by [7]. Targeted at premature infants with low\nbirth weight, the Infant Health and Development Program (IHDP) was a study of the ef\ufb01cacy of\n\u201ceducational and family support services and pediatric follow-up offered during the \ufb01rst 3 years of\nlife\u201d [3]. The study originally randomized infants into those that received treatment and those that\ndid not. The outcome variable was an IQ test applied when infants reached 3 years. Within those\nwhich received treatment, there was a range of number of days of treatment. That dose level was not\nrandomized, and again we do not have ground truth for the dose-response curve. For our assessment,\nwe \ufb01t a dose-response curve using Gaussian processes with Gaussian likelihood function and the\nback-door adjustment (3) on available covariates. We then use the model to generate independent\nsynthetic \u201cinterventional data.\u201d Measured covariates include birth weight, sex, whether the mother\nsmoked during pregnancy, among other factors detailed by [7, 3]. The Supplementary Material\ngoes in detail about the preprocessing, including R/MATLAB scripts to generate the data. The\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: An illustration of a problem generated from a model \ufb01tted to real data. That is, we generated\ndata from \u201cinterventions\u201d simulated from a model that was \ufb01tted to an actual study on premature\ninfant development [3], where the dose is the number of days that an infant is assigned to follow a\ndevelopment program and the outcome is an IQ test at age 3. (a) Posterior distribution for the stratum\nof infants whose mothers had up to some high school education, but no college. The red curve is the\nposterior mean of our method, and the blue curve the result of Gaussian process \ufb01t with interventional\ndata only. (b) Posterior distributions for the infants whose mothers had (some) college education. (c)\nThe combined strata.\n\nobservational sample contained 347 individuals (corresponding only to those which were eligible for\ntreatment and had no missing outcome variable) and 21 covariates. This sample included 243 infants\nwhose mother attended (some) high school but not college, and 104 with at least some college.\nWe generated 100 synthetic interventional datasets strati\ufb01ed by mother\u2019s education, (some) high-\nschool vs. (some) college. 19 treatment levels were pre-selected (0 to 450 days, increments of 25\ndays). All variables were standardized to zero mean and unit standard deviation according to the\nobservational distribution per stratum. Two representative simulated studies are shown in Figure 3,\ndepicting dose-response curves which have modest evidence of non-linearity, and differ in range\nper stratum2. On average, our method improved over the \ufb01tting of a Gaussian process with squared\nexponential covariance function that was given interventional data only. According to the average\nnormalized absolute differences, the improvement was 0.06, 0.07 and 0.08 for the high school,\ncollege and combined data, respectively (where error was reduced in 82%, 89% and 91% of the runs,\nrespectively), each in which 10 interventional samples were simulated per treatment level per stratum.\n\n5 Conclusion\n\nWe introduced a simple, principled way of combining observational and interventional measurements\nand assessed its accuracy and robustness. In particular, we emphasized robustness to model misspeci-\n\ufb01cation and we performed sensitivity analysis to assess the importance of each individual component\nof our prior, contrasted to off-the-shelf solutions that can be found in related domains [2].\nWe are aware that many practical problems remain. For instance, we have not discussed at all the\nimportant issue of sample selection bias, where volunteers for an interventional study might not come\nfrom the same p(Z) distribution as in the observational study. Worse, neither the observational nor\nthe interventional data might come from the population in which we want to enforce a policy learned\nfrom the combined data. While these essential issues were ignored, our method can in principle\nbe combined with ways of assessing and correcting for sample selection bias [1]. Moreover, if\nunmeasured confounding is too strong, one cannot expect to do well. Methods for sensitivity analysis\nof confounding assumptions [13] can be integrated with our framework. A more thorough analysis of\nactive learning using our approach, particularly in the light of possible model misspeci\ufb01cation, is\nneeded as our results in the Supplementary Material only super\ufb01cially covers this aspect.\n\nAcknowledgments\n\nThanks to Jennifer Hill for clari\ufb01cations about the IHDP data, and Robert Gramacy for several useful discussions.\n\n2We do not claim that these curves represent the true dose-response curves: confounders are very likely to\nexist, as the dose level was not decided at the beginning of the trial and is likely to have been changed \u201con the\n\ufb02y\u201d as the infant responded. It is plausible that our covariates cannot reliably account for this feedback effect.\n\n8\n\n050100150200250300350400450707580859095100105110115Treatment XOutcome YPosterior on dose\u2212response (high school)050100150200250300350400450708090100110120130Treatment XOutcome YPosterior on dose\u2212response (college)050100150200250300350400450707580859095100105110115120Treatment XOutcome YPosterior on dose\u2212response (all)\fReferences\n[1] E. Bareinboim and J. Pearl. Causal inference from Big Data: Theoretical foundations and the\n\ndata-fusion problem. Proceedings of the National Academy of Sciences, in press, 2016.\n\n[2] M. J. Bayarri, J. O. Berger, R. Paulo, J. Sacks, J. A. Cafeo, J. Cavendish, C.-H. Lin, and J. Tu.\n\nA framework for validation of computer models. Technometrics, 49:138\u2013154, 2007.\n\n[3] J. Brooks-Gunn, F. Liaw, and P. Klebanov. Effects of early intervention on cognitive function of\n\nlow birth weight preterm infants. Journal of Pediatrics, 120:350\u2013359, 1991.\n\n[4] A. Damianou and N. D. Lawrence. Deep Gaussian processes. Proceedings of the Sixteenth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), pages 207\u2013215,\n2013.\n\n[5] J. Ernest and P. B\u00fchlmann. Marginal integration for nonparametric causal inference. Electronic\n\nJournal of Statistics, 9:3155\u20133194, 2015.\n\n[6] R. Gramacy and H. K. Lee. Bayesian treed Gaussian process models with an application to\n\ncomputer modeling. Journal of the American Statistical Association, 103:1119\u20131130, 2008.\n\n[7] J. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and\n\nGraphical Statistics, 20:217\u2013240, 2011.\n\n[8] J. Hill and Y.-S. Su. Assessing lack of common support in causal inference using Bayesian\nnonparametrics: Implications for evaluating the effect of breastfeeding on children\u2019s cognitive\noutcomes. The Annals of Applied Statistics, 7:1386\u20131420, 2013.\n\n[9] A. Hyttinen, F. Eberhardt, and P. O. Hoyer. Experiment selection for causal discovery. Journal\n\nof Machine Learning Research, 14:3041\u20133071, 2013.\n\n[10] F. Liu, M. J. Bayarri, and J. O. Berger. Modularization in Bayesian analysis, with emphasis on\n\nanalysis of computer models. Bayesian Analysis, 4:119\u2013150, 2009.\n\n[11] D. J. C. MacKay. Information-based objective functions for active data selection. Neural\n\nComputation, 4:590\u2013604, 1992.\n\n[12] D. J. C. MacKay. Bayesian non-linear modelling for the prediction competition. ASHRAE\n\nTransactions, 100:1053\u20131062, 1994.\n\n[13] L. C. McCandless, P. Gustafson, and A. R. Levy. Bayesian sensitivity analysis for unmeasured\n\nconfounding in observational studies. Statistics in Medicine, 26:2331\u20132347, 2007.\n\n[14] S. L. Morgan and C. Winship. Counterfactuals and Causal Inference: Methods and Principles\n\nfor Social Research. Cambridge University Press, 2014.\n\n[15] R. Neal. Slice sampling. The Annals of Statistics, 31:705\u2013767, 2003.\n\n[16] J. Pearl. Causality: Models, Reasoning and Inference. Cambridge University Press, 2000.\n\n[17] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.\n\n[18] J. Robins, M. Sued, Q. Lei-Gomez, and A. Rotnitzky. Comment: Performance of double-\nrobust estimators when \"inverse probability\" weights are highly variable. Statistical Science,\n22:544\u2013559, 2007.\n\n[19] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. Cambridge\n\nUniversity Press, 2000.\n\n[20] T. VanderWeele and I. Shpitser. A new criterion for confounder selection. Biometrics, 64:1406\u2013\n\n1413, 2011.\n\n9\n\n\f", "award": [], "sourceid": 853, "authors": [{"given_name": "Ricardo", "family_name": "Silva", "institution": "University College London"}]}