{"title": "Removing Hidden Confounding by Experimental Grounding", "book": "Advances in Neural Information Processing Systems", "page_first": 10888, "page_last": 10897, "abstract": "Observational data is increasingly used as a means for making individual-level causal predictions and intervention recommendations. The foremost challenge of causal inference from observational data is hidden confounding, whose presence cannot be tested in data and can invalidate any causal conclusion. Experimental data does not suffer from confounding but is usually limited in both scope and scale. We introduce a novel method of using limited experimental data to correct the hidden confounding in causal effect models trained on larger observational data, even if the observational data does not fully overlap with the experimental data. Our method makes strictly weaker assumptions than existing approaches, and we prove conditions under which it yields a consistent estimator. We demonstrate our method's efficacy using real-world data from a large educational experiment.", "full_text": "Removing Hidden Confounding by\n\nExperimental Grounding\n\nNathan Kallus\n\nCornell University and Cornell Tech\n\nNew York, NY\n\nkallus@cornell.edu\n\nAahlad Manas Puli\nNew York University\n\nNew York, NY\n\napm470@nyu.edu\n\nUri Shalit\nTechnion\n\nHaifa, Israel\n\nurishalit@technion.ac.il\n\nAbstract\n\nObservational data is increasingly used as a means for making individual-level\ncausal predictions and intervention recommendations. The foremost challenge of\ncausal inference from observational data is hidden confounding, whose presence\ncannot be tested in data and can invalidate any causal conclusion. Experimental\ndata does not suffer from confounding but is usually limited in both scope and\nscale. We introduce a novel method of using limited experimental data to correct\nthe hidden confounding in causal effect models trained on larger observational data,\neven if the observational data does not fully overlap with the experimental data.\nOur method makes strictly weaker assumptions than existing approaches, and we\nprove conditions under which it yields a consistent estimator. We demonstrate our\nmethod\u2019s ef\ufb01cacy using real-world data from a large educational experiment.\n\n1\n\nIntroduction\n\nIn domains such as healthcare, education, and marketing there is growing interest in using observa-\ntional data to draw causal conclusions about individual-level effects; for example, using electronic\nhealthcare records to determine which patients should get what treatments, using school records to\noptimize educational policy interventions, or using past advertising campaign data to re\ufb01ne targeting\nand maximize lift. Observational datasets, due to their often very large number of samples and\nexhaustive scope (many measured covariates) in comparison to experimental datasets, offer a unique\nopportunity to uncover \ufb01ne-grained effects that may apply to many target populations.\nHowever, a signi\ufb01cant obstacle when attempting to draw causal conclusions from observational data\nis the problem of hidden confounders: factors that affect both treatment assignment and outcome, but\nare unmeasured in the observational data. Example cases where hidden confounders arise include\nphysicians prescribing medication based on indicators not present in the health record, or classes\nbeing assigned a teacher\u2019s aide because of special efforts by a competent school principal. Hidden\nconfounding can lead to no-vanishing bias in causal estimates even in the limit of in\ufb01nite samples\n[Pea09].\nIn an observational study, one can never prove that there is no hidden confounding [Pea09]. However,\na possible \ufb01x can be found if there exists a Randomized Controlled Trial (RCT) testing the effect\nof the intervention in question. For example, if a Health Management Organization (HMO) is\nconsidering the effect of a medication on its patient population, it might look at an RCT which tested\nthis medication. The problem with using RCTs is that often their participants do not fully re\ufb02ect the\ntarget population. As an example, an HMO in California might have to use an RCT from Switzerland,\nconducted perhaps several years ago, on a much smaller population. The problem of generalizing\nconclusions from an RCT to a different target population is known as the problem of external validity\n[Rot05, AO17], or more speci\ufb01cally, transportability [BP13, PB14].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn this paper, we are interested in the case where \ufb01ne-grained causal inference is sought, in the form of\nConditional Average Treatment Effects (CATE), where we consider a large set of covariates, enough\nto identify each unit. We aim at using a large observational sample and a possibly much smaller\nexperimental sample. The typical use case we have in mind is of a user who wishes to estimate\nCATE and has a relatively large observational sample that covers their population of interest. This\nobservational sample might suffer from hidden confounding, as all observational data will to some\nextent, but they also have a smaller sample from an experiment, albeit one that might not directly\nre\ufb02ect their population of interest. For example, consider The Women\u2019s Health Initiative [Ros02]\nwhere there was a big previous observational study and a smaller RCT to study hormone replacement\ntherapy. The studies ended up with opposite results and there is intense discussion about confounding\nand external validity: the RCT was limited due to covering a fundamentally different (healthier and\nyounger) population compared with the observational study [HAL+08, Van09].\nDifferently from previous work on estimating CATE from observational data, our approach does\nnot assume that all confounders have been measured, and we only assume that the support of the\nexperimental study has some overlap with the support of the observational study. The major as-\nsumption we do make is that we can learn the structure of the hidden confounding by comparing\nthe observational and experimental samples. Speci\ufb01cally, rather than assuming that effects them-\nselves have a parametric structure \u2013 a questionable assumption that is bound to lead to dangerous\nextrapolation from small experiments \u2013 we only assume that this hidden confounding function has a\nparametric structure that we can extrapolate. Thus we limit ourselves to a parametric correction of a\npossibly complex effect function learned on the observational data. We discuss why this assumption\nis possibly reasonable. Speci\ufb01cally, as long as the parametric family includes the zero function, this\nassumption is strictly weaker than assuming that all confounders in the observational study have been\nobserved. One way to view our approach is that we bring together an unbiased but high-variance\nestimator from the RCT (possibly in\ufb01nite-variance when the RCT has zero overlap with the target\npopulation) and a biased but low-variance estimator from the observational study. This achieves\na consistent (vanishing bias and variance) CATE estimator. Finally, we run experiments on both\nsimulation and real-world data and show our method outperforms the standard approaches to this\nproblem. In particular, we use data from a large-scale RCT measuring the effect of small classrooms\nand teacher\u2019s aids [WJB+90, Kru99] to obtain ground-truth estimates of causal effects, which we\nthen try and reproduce from a confounded observational study.\n\n2 Setup\n\n, T Unc\n\n, Y Unc\n\ni\n\ni\n\ni\n\nWe focus on studying a binary treatment, which we interpret as the presence or absence of an\nintervention of interest. To study its \ufb01ne-grained effects on individuals, we consider having treatment-\noutcome data from two sources: an observational study that may be subject to hidden confounding,\nand an unconfounded study, typically coming from an experiment. The observational data consists\n\u2208 {0, 1}, and observed outcomes\nof baseline covariates XConf\nY Conf\nfor\ni\ni = 1, . . . , nUnc.\nConceptually, we focus on the setting where (1) the observational data is of much larger scale\n) = {x :\nnUnc (cid:28) nConf and/or (2) the support of the unconfounded data Support(XUnc\n\ni \u2212 x(cid:107) \u2264 \u03b4(cid:1) > 0 \u2200\u03b4 > 0}, does not include the population about which we want to\n\n\u2208 Rd, assigned treatments T Conf\n\nP(cid:0)(cid:107)XUnc\n\n\u2208 R for i = 1, . . . , nConf. Similarly, the unconfounded data consists of XUnc\n\ni\n\ni\n\nmake causal conclusions and targeted interventions. This means that the observational data has both\nthe scale and the scope we want but the presence of confounding limits the study of causal effects,\nwhile the unconfounded experimental data has unconfoundedness but does not have the scale and/or\nscope necessary to study the individual-level effects of interest.\nThe unconfounded data usually comes from an RCT that was conducted on a smaller scale on\na different population, as presented in the previous section. Alternatively, and equivalently for\nour formalism, it can arise from recognizing a latent unconfounded sub-experiment within the\nobservational study. For example, we may have information from the data generation process that\nindicates that treatment for certain units was actually assigned purely as a (possibly stochastic)\nfunction of the observed covariates x. Two examples of this would be when certain prognoses dictate\na strict rule-based treatment assignment or in situations of known equipoise after a certain prognosis,\nwhere there is no evidence guiding treatment one way or the other and its assignment is as if at random\nbased on the individual who ends up administering it. Regardless if the unconfounded data came\n\ni\n\n2\n\n\fi\n\ni\n\ni\n\n, T Unc\n\n, Y Unc\n\ni\n\nfrom a secondary RCT (more common) or from within the observational dataset, our mathematical\nset up remains the same.\nFormally, we consider each dataset to be iid draws from two different super-populations, indicated\nby the event E taking either the value EConf or EUnc. The observational data are iid draws from\n\u223c (X, T, Y | EConf)\nthe population given by conditioning on the event EConf: XConf\ni \u223c (X, T, Y | EUnc). Using potential outcome notation, assuming\niid. Similarly, XUnc\nthe standard Stable Unit Treatment Value Assumption (SUTVA), which posits no interference and\nconsistency between observed and potential outcomes, we let Y (0), Y (1) be the potential outcomes\nof administering each of the two treatments and Y = Y (T ) = T Y (1) + (1 \u2212 T )Y (0). The quantity\nwe are interested in is the Conditional Average Treatment Effect (CATE):\nDe\ufb01nition 1 (CATE). Let \u03c4 (x) = E [Y (1) \u2212 Y (0)|X = x].\nThe key assumption we make about the unconfounded data is its unconfoundedness:\nAssumption 1. [Unconfounded experiment]\n\n, Y Conf\n\n, T Conf\n\ni\n\n(i) Y (0), Y (1) \u22a5\u22a5 T | X, EUnc\n(ii) Y (0), Y (1) \u22a5\u22a5 EUnc | X.\n\nThis assumption holds if the unconfounded data was generated in a randomized control trial. More\ngenerally, it is functionally equivalent to assuming that the unconfounded data was generated by\nrunning a logging policy on a contextual bandit, that is, \ufb01rst covariates are drawn from the uncon-\nfounded population X | EUnc and revealed, then a treatment T is chosen, the outcomes are drawn\nbased on the covariates Y (0), Y (1) | X, but only the outcome corresponding to the chosen treatment\nY = Y (T ) is revealed. The second part of the assumption means that merely being in the uncon-\nfounded study does not affect the potential outcomes conditioned on the covariates X. It implies\nthat the functional relationship between the unobserved confounders and the potential outcomes is\nthe same in both studies. This will fail if for example knowing you are part of a study causes you\nto react differently to the same treatment. We note that this assumption is strictly weaker than the\nstandard ignorability assumption in observational studies. This assumption implies that for covariates\nwithin the domain of the experiment, we can identify the value of CATE using regression. Specif-\n\nically, if x \u2208 Support(X | EUnc), that is, if P(cid:0)(cid:107)X \u2212 x(cid:107) \u2264 \u03b4 | EUnc(cid:1) > 0 \u2200\u03b4 > 0, then \u03c4 (x) =\nE(cid:2)Y | T = 1, X = x, EUnc(cid:3) \u2212 E(cid:2)Y | T = 0, X = x, EUnc(cid:3), where E(cid:2)Y | T = t, X = x, EUnc(cid:3)\n\ncan be identi\ufb01ed by regressing observed outcomes on treatment and covariates in the unconfounded\ndata. However, this identi\ufb01cation of CATE is (i) limited to the restricted domain of the experiment\nand (ii) hindered by the limited amount of data available in the unconfounded sample. The hope is to\novercome these obstacles using the observational data.\nImportantly, however, the unconfoundedness assumption is not assumed to hold for the observational\ndata, which may be subject to unmeasured confounding. That is, both selection into the observational\nstudy and the selection of the treatment may be confounded with the potential outcomes of any one\ntreatment. Let us denote the difference in conditional average outcomes in the observational data by\n\n\u03c9(x) = E(cid:2)Y | T = 1, X = x, EConf(cid:3) \u2212 E(cid:2)Y | T = 0, X = x, EConf(cid:3) .\n\nNote that due to confounding factors, \u03c9(x) (cid:54)= \u03c4 (x) for any x, whether in the support of the\nobservational study or not. The difference between these two quantities is precisely the confounding\neffect, which we denote\n\n\u03b7(x) = \u03c4 (x) \u2212 \u03c9(x).\n\nAnother way to express this term is:\n\n\u03b7(x) = {E [Y (1)|x] \u2212 E [Y (1)|x, T = 1]} \u2212 {E [Y (0)|x] \u2212 E [Y (0)|x, T = 0]}.\n\nNote that if the observational study were unconfounded then we would have \u03b7(x) = 0. Further note\nthat a standard assumption in the vast majority of methodological literature makes the assumption\nthat \u03b7(x) \u2261 0, even though it is widely acknowledged that this assumption isn\u2019t realistic, and is at\nbest an approximation.\nExample. In order to better understand the function \u03b7(x), consider the following case: Assume\nthere are two equally likely types of patients,\u201cdutiful\u201d and \u201cnegligent\u201d. Dutiful patients take care of\ntheir general health and are more likely to seek treatment, while negligent patients do not. Assume\n\n3\n\n\fT = 1 is a medical treatment that requires the patient to see a physician, do lab tests, and obtain a\nprescription if indeed needed, while T = 0 means no treatment. Let Y be some measure of health,\nsay blood pressure. In this scenario, where patients are self-selected into treatment (to a certain\ndegree), we would expect that both potential outcomes would be greater for the treated over the\ncontrol: E [Y (1)|T = 1] > E [Y (1)|T = 0], E [Y (0)|T = 1] > E [Y (0)|T = 0]. Since E [Y (1)] =\nE [Y (1)|T = 1] p(T = 1)+E [Y (1)|T = 0] p(T = 0) we also have that E [Y (1)] > E [Y (1)|T = 1],\nand E [Y (0)] > E [Y (0)|T = 0] unless p(T = 0) = 1. Taken together, this shows that in the above\nscenario, we expect \u03b7 < 0, if we haven\u2019t measured any X. This logic carries through in the plausible\nscenario where we have measured some X, but do not have access to all the variables X that allows\nus to tell apart \u201cdutiful\u201d from \u201cnegligent\u201d patients. To sum up, this example shows that in cases where\nsome units are selected such as those more likely to be treated are those whose potential outcomes\nare higher (resp. lower) anyway, we can expect \u03b7 to be negative (resp. positive).\n\n3 Method\n\ni\n\ni\n\ni\n\n, T Conf\n\n, Y Conf\n\nGiven data from both the unconfounded and confounded studies, we propose the following recipe\nfor removing the hidden confounding. First, we learn a function \u02c6\u03c9 over the observational sample\n{XConf\n}nConf\ni=1 . This can be done using any CATE estimation method such as learning two\nregression functions for the treated and control and taking their difference, or specially constructed\nmethods such as Causal Forest [WA17]. Since we assume this sample has hidden confounding, \u03c9\nis not equal to the true CATE and correspondingly \u02c6\u03c9 does not estimate the true CATE. We then\nlearn a correction term which interpolates between \u02c6\u03c9 evaluated on the RCT samples XUnc\n, and the\nRCT outcomes Y Unc\n. This is a correction term for hidden confounding, which is our estimate of\n\u03b7. The correction term allows us to extrapolate \u03c4 over the confounded sample, using the identity\n\u03c4 (X) = \u03c9(X) + \u03b7(X).\nNote that we could not have gone the other way round: if we were to start with estimating \u03c4 over the\nunconfounded sample, and then estimate \u03b7 using the samples from the confounded study, we would\nend up constructing an estimate of \u03c9(x), which is not the quantity of interest. Moreover, doing so\nwould be dif\ufb01cult as the unconfounded sample is not expected to cover the confounded one.\nSpeci\ufb01cally, the way we use the RCT samples relies on a simple identity. Let eUnc(x) =\nIf this sample\n\nP(cid:0)T = 1 | X = x, EUnc(cid:1) be the propensity score on the unconfounded sample.\n\ni\n\ni\n\nis an RCT then typically eUnc(x) = q for some constant, often q = 0.5.\n\n) = T Unc\neUnc(XUnc\n\ni\n\ni\n\nLet q(XUnc\nLemma 1.\n\ni\n\ni\n\n) \u2212 1\u2212T Unc\n1\u2212eUnc(XUnc\nE(cid:2)q(XUnc\n\ni\n\ni\n\n) be a signed re-weighting function. We have:\n\n, EUnc(cid:3) = \u03c4 (XUnc\n\ni\n\n)Y Unc\n\ni\n\n|XUnc\n\ni\n\n).\n\n(1)\n\nWhat Lemma 1 shows us is that q(XUnc\nfact to learn \u03b7 as follows:\n\ni\n\nis an unbiased estimate of \u03c4 (XUnc\n\ni\n\n(cid:1)2\n\n). We now use this\n\n(2)\n\n(3)\n\nnUnc(cid:88)\n\n)Y Unc\n\ni\n\n(cid:0)q(XUnc\n\ni\n\n\u02c6\u03b8 = arg min\n\n)Y Unc\n\ni \u2212 \u02c6\u03c9(XUnc\n\ni\n\n) \u2212 \u03b8(cid:62)XUnc\n\ni\n\nLet\n\n\u03b8\n\ni=1\n\n\u02c6\u03c4 (x) = \u02c6\u03c9(x) + \u02c6\u03b8(cid:62)x.\n\nThe method is summarized in Algorithm 1.\nLet us contrast our approach with two existing ones. The \ufb01rst, is to simply learn the treatment effect\nfunction directly from the unconfounded data, and extrapolate it to the observational sample. This is\nguaranteed to be unconfounded, and with a large enough unconfounded sample the CATE function\ncan be learned [CHIM08, Pea15]. This approach is presented for example by [BP13] for ATE, as the\ntransport formula. However, extending this approach to CATE in our case is not as straightforward.\nThe reason is that we assume that the confounded study does not fully overlap with the unconfounded\nstudy, which requires extrapolating the estimated CATE function into a region of sample space\noutside the region where it was \ufb01t. This requires strong parametric assumptions about the CATE\n\n4\n\n\fAlgorithm 1 Remove hidden confounding with unconfounded sample\n1: Input:\nUnconfounded\npropensity\n{XUnc\n, eUnc(XUnc\nAlgorithm Q for \ufb01tting CATE.\n\nsample\nscores\n)}nUnc\ni=1 . Confounded sample DConf = {XConf\n\n, Y Unc\n\nwith\n\n, T Unc\n\ni\n\ni\n\ni\n\ni\n\n2: Run Q on DConf, obtain CATE estimate \u02c6\u03c9.\n3: Let \u02c6\u03b8 be the solution of the optimization problem in Equation (2).\n4: Set function \u02c6\u03c4 (x) := \u02c6\u03c9(x) + \u02c6\u03b8(cid:62)x\n5: Return: \u02c6\u03c4, an estimate of CATE over DConf.\n\nDUnc\n\n, T Conf\n\ni\n\n, Y Conf\n\ni\n\n=\n}nConf\ni=1 .\n\ni\n\nfunction. On the other hand, we do have samples from the target region, they are simply confounded.\nOne way to view our approach is that we move the extrapolation a step back: instead of extrapolating\nthe CATE function, we merely extrapolate a correction due to hidden confounding. In the case that\nthe CATE function does actually extrapolate well, we do no harm - we learn \u03b7 \u2248 0.\nThe second alternative relies on re-weighting the RCT population so as to make it similar to the\ntarget, observational population [SCBL11, HGRS15, AO17]. These approaches suffer from two\nimportant drawbacks from our point of view: (i) they assume the observational study has no unmea-\nsured confounders, which is often an unrealistic assumption; (ii) they assume that the support of\nthe observational study is contained within the support of the experimental study, which again is\nunrealistic as the experimental studies are often smaller and on somewhat different populations. If\nwe were to apply these approaches to our case, we would be re-weighting by the inverse of weights\nwhich are close to, or even identical to, 0.\n\n4 Theoretical guarantee\n\nWe prove that under conditions of parametric identi\ufb01cation of \u03b7, Algorithm 1 recovers a consistent\nestimate of \u03c4 (x) over the EConf, at a rate which is governed by the rate of estimating \u03c9 by \u02c6\u03c9. For\nthe sake of clarity, we focus on a linear speci\ufb01cation of \u03b7. Other parametric speci\ufb01cations can\neasily be accommodated given that the appropriate identi\ufb01cation criteria hold (for linear this is the\nnon-singularity of the design matrix). Note that this result is strictly stronger than results about CATE\nidenti\ufb01cation which rely on ignorability: what enables the improvement is of course the presence\nof the unconfounded sample EUnc. Also note that this result is strictly stronger than the transport\nformula [BP13] and re-weighting such as [AO17].\nTheorem 1. Suppose\n\n1. \u02c6\u03c9 is a consistent estimator on the observational data (on which it\u2019s trained):\n\nE[(\u02c6\u03c9(X) \u2212 \u03c9(X))2 | EConf] = O(r(n)) for r(n) = o(1)\n\noverlap): \u2203\u03ba > 0 : P(cid:0)EUnc | X(cid:1) \u2264 \u03baP(cid:0)EConf | X(cid:1)\n\n2. The covariates in the confounded data cover those in the unconfounded data (strong one-way\n\n3. \u03b7 is linear: \u2203\u03b80 : \u03b7(x) = \u03b8(cid:62)\n0 x\n4. Identi\ufb01ability of \u03b80: E[XX(cid:62) | EUnc] is non-singular\n5. X, Y , and \u02c6\u03c9(X) have \ufb01nite fourth moments in the experimental data: E[(cid:107)X(cid:107)4\n\nE[Y 4 | EUnc] < \u221e, E[\u02c6\u03c9(X)4 | EUnc] < \u221e\n\n2 | EUnc] < \u221e,\n\n6. Strong overlap between treatments in unconfounded data: \u2203\u03bd > 0 : \u03bd \u2264 eUnc(X) \u2264 1 \u2212 \u03bd\n\nThen \u02c6\u03b8 is consistent\n\n(cid:107)\u02c6\u03b8 \u2212 \u03b80(cid:107)2\nand \u02c6\u03c4 is consistent on its target population\n\n2 = Op(r(n) + 1/n)\n\n((\u02c6\u03c4 (X) \u2212 \u03c4 (X))2 | EConf) = Op(r(n) + 1/n)\n\nThere are a few things to note about the result and its conditions. First, we note that if the so-called\nconfounded observational sample is in fact unconfounded then we immediately get that the linear\n\n5\n\n\fspeci\ufb01cation of \u03b7 is correct with \u03b80 = 0 because we simply have \u03b7(x) = 0. Therefore, our conditions\nare strictly weaker than imposing unconfoundedness on the observational data.\nCondition 1 requires that our base method for learning \u03c9 is consistent just as a regression method.\nThere are a few ways to guarantee this. For example, if we \ufb01t \u02c6\u03c9 by empirical risk minimization on\nweighted outcomes over a function class of \ufb01nite capacity (such as a VC class) or if we \ufb01t as the\ndifference of two regression functions each \ufb01t by empirical risk minimization on observed outcomes\nin each treatment group, then standard results in statistical learning [BM02] ensure the consistency\nof L2 risk and therefore the L2 convergence required in condition 1. Alternatively, any method for\nlearning CATE that would have been consistent for CATE under unconfoundedness would actually\nstill be consistent for \u03c9 if applied. Therefore we can also rely on such base method as causal forests\n[WA17] and other methods that target CATE as inputs to our method, even if they don\u2019t actually learn\nCATE here due to confounding.\nCondition 2 captures our understanding of the observational dataset having a larger scope than the\nexperimental dataset. The condition essentially requires a strong form of absolute continuity between\nthe two covariate distributions. This condition could potentially be relaxed so long as there is enough\nintersection where we can learn \u03b7. So for example, if there is a subset of the experiment that the\nobservational data covers, that would be suf\ufb01cient so long as we can also ensure that condition 4 still\nremains valid on that subset so that we can learn the suf\ufb01cient parameters for \u03b7.\nCondition 3, the linear speci\ufb01cation of \u03b7, can be replaced with another one so long as it has \ufb01nitely\nmany parameters and they can be identi\ufb01ed on the experimental dataset, i.e., condition 4 above would\nchange appropriately.\nSince unconfoundedness implies \u03b7 = 0, whenever the parametric speci\ufb01cation of \u03b7 contains the zero\nfunction (e.g., as in the linear case above since \u03b80 = 0 is allowed) condition 3 is strictly weaker\nthan assuming unconfoundedness. In that sense, our method can consistently estimate CATE on a\npopulation where no experimental data exists under weaker conditions than existing methods, which\nassume the observational data is unconfounded.\nCondition 5 is trivially satis\ufb01ed whenever outcomes and covariates are bounded. Similarly, we would\nexpect that if the \ufb01rst two parts of condition 5 hold (about X and Y ) then the last one about \u02c6\u03c9\nwould also hold as it is predicting outcomes Y . That is, the last part of condition 5 is essentially a\nrequirement on our \u02c6\u03c9-leaner base method that it\u2019s not doing anything strange like adding unnecessary\nnoise to Y thereby making it have fewer moments. For all base methods that we consider, this would\ncome for free because they are only averaging outcomes Y . We also note that if we impose the\nexistence of even higher moments as well as pointwise asymptotic normality of \u02c6\u03c9, one can easily\ntransform the result to an asymptotic normality result. Standard error estimates will in turn require a\nvariance estimate of \u02c6\u03c9.\nFinally, we note that condition 6, which requires strong overlap, only needs to hold in the uncon-\nfounded sample. This is important as it would be a rather strong requirement in the confounded\nsample where treatment choices may depend on high dimensional variables [DDF+17], but it is a\nweak condition for the experimental data. Speci\ufb01cally, if the unconfounded sample arose from an\nRCT then propensities would be constant and the condition would hold trivially.\n\n5 Experiments\n\nIn order to illustrate the validity and usefulness of our proposed method we conduct simulation\nexperiments and experiments with real-world data taken from the Tennessee STAR study: a large long-\nterm school study where students were randomized to different types of classes [WJB+90, Kru99].\n\n5.1 Simulation study\n\nWe generate data simulating a situation where there exists an un-confounded dataset and a confounded\ndataset, with only partial overlap. Let X \u2208 R be a measured covariate, T \u2208 {0, 1} binary treatment\nassignment, U \u2208 R an unmeasured confounder, and Y \u2208 R the outcome. We are interested in \u03c4 (X).\nWe generate the unconfounded sample as follows: XUnc \u223c Uniform [\u22121, 1], U Unc \u223c N (0, 1),\nT Unc \u223c Bernoulli(0.5). We generate the confounded sample as follows: we \ufb01rst sample T Conf \u223c\n\n6\n\n\fFigure 1: True and predicted \u03c4 and \u03b7 for an unconfounded sample of limited overlap with the\nconfounded sample: unconfounded samples are limited to [\u22121, 1] (blue shaded region); confounded\nsamples lie in \u2212[3, 3]; predicted \u02c6w is from difference of regressions on treated and control y.\nBernoulli(0.5) and then sample XConf, U Conf from a bivariate Gaussian\n\n(cid:18)\n\n(cid:20)\n\n(XConf, U Conf)|T Conf \u223c N\n\n[0, 0],\n\n1\n\nT Conf \u2212 0.5\n\nT Conf \u2212 0.5\n\n1\n\n(cid:21)(cid:19)\n\n.\n\nThis means that XConf, U Conf come from a Gaussian mixture model where T Conf denotes the mixture\ncomponent and the components have equal means but different covariance structures. This also\nimplies that \u03b7 is linear.\nFor both datasets we set Y = 1 + T + X + 2 \u00b7 T \u00b7 X + 0.5X 2 + 0.75 \u00b7 T \u00b7 X 2 + U + 0.5\u0001, where\n\u0001 \u223c N (0, 1). The true CATE is therefore \u03c4 (X) = 0.75X 2 + 2X + 1. We have that the true\n\u03c9 = \u03c4 + E[U|X, T = 1]\u2212 E[U|X, T = 0], which leads to the true \u03b7 = x. We then apply our method\n(with a CF base) to learn \u03b7. We plot (See Figure 1) here the true and recovered \u03b7 with our method.\nEven with the limited un-confounded set (between \u22121, 1) making the full scope of the x2 term in\nY inaccessible, we are able to reasonably estimate \u03c4. Other methods would suffer under the strong\nunobserved confounding.\n\n5.2 Real-world data\n\nValidating causal-inference methods is hard because we almost never have access to true counterfac-\ntuals. We approach this challenge by using data from a randomized controlled trial, the Tennessee\nSTAR study [WJB+90, Kru99, MISN18]. When using an RCT, we have access to unbiased CATE-\nestimates because we are guaranteed unconfoundedness. We then arti\ufb01cially introduce confounding\nby selectively removing a biased subset of samples.\nThe data: The Tennessee Student/Teacher Achievement Ratio (STAR) experiment is a randomized\nexperiment started in 1985 to measure the effect of class size on student outcomes, measured by\nstandardized test scores. The experiment started monitoring students in kindergarten and followed\nstudents until third grade. Students and teachers were randomly assigned into conditions during the\n\ufb01rst school year, with the intention for students to continue in their class-size condition for the entirety\nof the experiment. We focus on two of the experiment conditions: small classes(13-17 pupils), and\nregular classes(22-25 pupils). Since many students only started the study at \ufb01rst grade, we took as\ntreatment their class-type at \ufb01rst grade. Overall we have 4509 students with treatment assignment\nat \ufb01rst grade. The outcome Y is the sum of the listening, reading, and math standardized test at the\nend of \ufb01rst grade. After removing students with missing outcomes 1, we remain with a randomized\nsample of 4218 students: 1805 assigned to treatment (small class, T = 1), and 2413 to control\n(regular size class, T = 0). In addition to treatment and outcome, we used the following covariates\nfor each student: gender, race, birth month, birthday, birth year, free lunch given or not, teacher id.\nOur goal is to compute the CATE conditioned on this set of covariates, jointly denoted X.\nComputing ground-truth CATE: The STAR RCT allows us to obtain an unbiased estimate of the\nCATE. Speci\ufb01cally, we use the identity in Eq. (1), and the fact that in the study, the propensity scores\n\n1The correlation between missing outcome and treatment assignment is R2 < 10\u22124.\n\n7\n\n\fon a held-out evaluation subset of ALL \\ UNC, for varying\nFigure 2: RMSE of estimating Y GT\nsizes of the unconfounded subset. RF and ridge stand for Random Forest and Ridge Regression,\nrespectively. 2 step is our method. [RF/ridge] YGT\n. [RF/ridge]\nDIFF is the difference between the predictions of models \ufb01t on treated and control separately. UNC\nor CONF in parentheses indicates which subset of the data was used for regression.\n\nis regression directly on Y GT\n\ni\n\ni\n\ni\n\ne(Xi) were constant. We de\ufb01ne a ground-truth sample {(Xi, Y GT\n\nq = p(T = 1). By Eq. (1) we know that E(cid:2)Y GT\n\n(cid:3) = \u03c4 (Xi) within the STAR study.\n\ni=1, where Y GT\n\n|Xi\n\n)}n\n\ni\n\ni\n\ni = Yi\n\nq+Ti\u22121,\n\nIntroducing hidden confounding: Now that we have the ground-truth CATE, we wish to emulate\nthe scenario which motivates our work. We split the entire dataset (ALL) into a small unconfounded\nsubset (UNC), and a larger, confounded subset (CONF) over a somewhat different population. We do\nthis by splitting the population over a variable which is known to be a strong determinant of outcome\n[Kru99]: rural or inner-city (2811 students) vs. urban or suburban (1407 students).\nWe generate UNC by randomly sampling a fraction q(cid:48) of the rural or inner-city students, where q(cid:48)\nranges from 0.1 to 0.5. Over this sample, we know that treatment assignment was at random.\nWhen generating CONF, we wish to obtain two goals: (a) the support of CONF should have only a\npartial overlap with the support of UNC, and (b) treatment assignment should be confounded, i.e.\nthe treated and control populations should be systematically different in their potential outcomes. In\norder to achieve these goals, we generate CONF as follows: From the rural or inner-city students,\nwe take the controls (T = 0) that were not sampled in UNC, and only the treated (T = 1) whose\noutcomes were in the lower half of outcomes among treated rural or inner-city students. From the\nurban or suburban students, we take all of the controls, and only the treated whose outcomes were in\nthe lower half of outcomes among treated urban or suburban students.\nThis procedure results in UNC and CONF populations which do not fully overlap: UNC has only\nrural or inner-city students, while CONF has a substantial subset (roughly one half for q(cid:48) = 0.5)\nof urban and suburban students. It also creates confounding, by removing the students with the\nhigher scores selectively from the treated population. This biases the naive treatment effect estimates\ndownward. We further complicate matters by dropping the covariate indicating rural, inner-city, urban\nor suburban from all subsequent analysis. Therefore, we have signi\ufb01cant unmeasured confounding in\nthe CONF population, and also the unconfounded ground-truth in the original, ALL population.\nMetric: In our experiments, we assume we have access to samples from UNC and CONF. We use\neither UNC, CONF or both to \ufb01t various models for predicting CATE. We then evaluate how well\non a held-out sample from ALL \\ UNC (the set ALL minus the\nthe CATE predictions match Y GT\nset UNC), in terms of RMSE. Note that we are not evaluating on CONF, but on the unconfounded\nversion of CONF, which is exactly ALL \\ UNC. The reason we don\u2019t evaluate on ALL is twofold:\nFirst, it will only make the task easier because of the nature of the UNC set; second, we are motivated\nby the scenario where we have a confounded observational study representing the target population\nof interest, and wish to be aided by a separate unconfounded study (typically an RCT) available for a\ndifferent population. We focus on a held-out set in order to avoid giving too much of an advantage to\nmethods which can simply \ufb01t the observed outcomes well.\n\ni\n\n8\n\n0.10.20.30.40.5size of UNCONF as fraction of rural737679828588RMSE on EVAL set2 step RF2 step ridgeridge YGT (UNC)ridge DIFF(UNC) RF YGT(UNC) RF DIFF(UNC) ridge DIFF(CONF)RF DIFF(CONF)\fBaselines: As a baseline we \ufb01t CATE using standard methods on either the UNC set or the CONF\nset. Fitting on the UNC set is essentially a CATE version of applying the transport formula [PB14].\nFitting on the CONF set amounts to assuming ignorability (which is wrong in this case), and using\nstandard methods. The methods we use to estimate CATE are: (i) Regression method \ufb01t on Y GT\nover\nUNC (ii) Regression method \ufb01t separately on treated and control in CONF (iii) Regression method\n\ufb01t separately on treated and control in UNC. The regression methods we use in (i)-(iii) are Random\nForest with 200 trees and Ridge Regression with cross-validation. In baselines (ii) and (iii), the CATE\nis estimated as the difference between the prediction of the model \ufb01t on the treated and the prediction\nof the model \ufb01t on the control. We also experimented extensively with Causal Forest [WA17], but\nfound it to uniformly perform worse than the other methods, even when given unfair advantages such\nas access to the entire dataset (ALL).\nResults: Our two-step method requires a method for \ufb01tting \u02c6\u03c9 on the confounded dataset. We\nexperiment with two methods, which parallel those used as baseline: A regression method \ufb01t\nseparately on treated and control in CONF, where we use either Random Forest with 200 trees or\nRidge Regression with cross-validation as regression methods. We see that our methods, 2-step RF\nand 2-step ridge, consistently produce more accurate estimates than the baselines. We see that our\nmethods in particular are able to make use of larger unconfounded sets to produce better estimates of\nthe CATE function.See Figure 2 for the performance of our method vs. the various baselines.\n\ni\n\n6 Discussion\n\nIn this paper we address a scenario that is becoming more and more common: users with large\nobservational datasets who wish to extract causal insights using their data and help from unconfounded\nexperiments on different populations. One direction for future work is combining the current work\nwith work that looks explicitly into the causal graph connecting the covariates, including unmeasured\nones [TT15, MMC16]. Another direction includes cases where the outcomes or interventions are\nnot directly comparable, but where the difference can be modeled. For example, experimental\nstudies often only study short-term outcomes, whereas the observational study might track long-term\noutcomes which are of more interest [ACIK16].\n\nAcknowledgements\n\nWe wish to thank the anonymous reviewers for their helpful suggestions and comments. (NK) This\nmaterial is based upon work supported by the National Science Foundation under Grant No. 1656996.\n\nReferences\n[ACIK16] Susan Athey, Raj Chetty, Guido Imbens, and Hyunseung Kang. Estimating treatment\neffects using multiple surrogates: The role of the surrogate score and the surrogate index.\narXiv preprint arXiv:1603.09326, 2016.\n\n[AO17] Isaiah Andrews and Emily Oster. Weighting for external validity. Technical report,\n\nNational Bureau of Economic Research, 2017.\n\n[BM02] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk\nbounds and structural results. Journal of Machine Learning Research, 3(Nov):463\u2013482,\n2002.\n\n[BP13] Elias Bareinboim and Judea Pearl. A general algorithm for deciding transportability of\n\nexperimental results. Journal of causal Inference, 1(1):107\u2013134, 2013.\n\n[CHIM08] Richard K Crump, V Joseph Hotz, Guido W Imbens, and Oscar A Mitnik. Nonpara-\nmetric tests for treatment effect heterogeneity. The Review of Economics and Statistics,\n90(3):389\u2013405, 2008.\n\n[DDF+17] Alexander D\u2019Amour, Peng Ding, Avi Feller, Lihua Lei, and Jasjeet Sekhon. Overlap in\nobservational studies with high-dimensional covariates. arXiv preprint arXiv:1711.02582,\n2017.\n\n9\n\n\f[HAL+08] Miguel A Hern\u00e1n, Alvaro Alonso, Roger Logan, Francine Grodstein, Karin B Michels,\nMeir J Stampfer, Walter C Willett, JoAnn E Manson, and James M Robins. Observational\nstudies analyzed like randomized experiments: an application to postmenopausal hor-\nmone therapy and coronary heart disease. Epidemiology (Cambridge, Mass.), 19(6):766,\n2008.\n\n[HGRS15] Erin Hartman, Richard Grieve, Roland Ramsahai, and Jasjeet S Sekhon. From sample\naverage treatment effect to population average treatment effect on the treated: combining\nexperimental with observational studies to estimate population treatment effects. Journal\nof the Royal Statistical Society: Series A (Statistics in Society), 178(3):757\u2013778, 2015.\n\n[Kru99] Alan B Krueger. Experimental estimates of education production functions. The quarterly\n\njournal of economics, 114(2):497\u2013532, 1999.\n\n[MISN18] Edward McFowland III, Sriram Somanchi, and Daniel B Neill. Ef\ufb01cient discovery\nof heterogeneous treatment effects in randomized experiments via anomalous pattern\ndetection. arXiv preprint arXiv:1803.09159, 2018.\n\n[MMC16] Joris M Mooij, Sara Magliacane, and Tom Claassen. Joint causal inference from multiple\n\ndatasets. arXiv preprint arXiv:1611.10351, 2016.\n\n[PB14] Judea Pearl and Elias Bareinboim. External validity: From do-calculus to transportability\n\nacross populations. Statistical Science, 29(4):579\u2013595, 2014.\n\n[Pea09] Judea Pearl. Causality. Cambridge university press, 2009.\n\n[Pea15] Judea Pearl. Detecting latent heterogeneity. Sociological Methods & Research, page\n\n0049124115600597, 2015.\n\n[Ros02] Jacques E Rossouw. Writing group for the women\u2019s health initiative investigators. risks\nand bene\ufb01ts of estrogen plus progestin in healthy postmenopausal women: principal\nresults from the Women\u2019s Health Initiative randomized controlled trial. JAMA, 288:321\u2013\n333, 2002.\n\n[Rot05] Peter M Rothwell. External validity of randomised controlled trials:\u201cto whom do the\n\nresults of this trial apply?\u201d. The Lancet, 365(9453):82\u201393, 2005.\n\n[SCBL11] Elizabeth A Stuart, Stephen R Cole, Catherine P Bradshaw, and Philip J Leaf. The use of\npropensity scores to assess the generalizability of results from randomized trials. Journal\nof the Royal Statistical Society: Series A (Statistics in Society), 174(2):369\u2013386, 2011.\n\n[TT15] So\ufb01a Trianta\ufb01llou and Ioannis Tsamardinos. Constraint-based causal discovery from\nmultiple interventions over overlapping variable sets. Journal of Machine Learning\nResearch, 16:2147\u20132205, 2015.\n\n[Van09] Jan P Vandenbroucke. The HRT controversy: observational studies and rcts fall in line.\n\nThe Lancet, 373(9671):1233\u20131235, 2009.\n\n[WA17] Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment\neffects using random forests. Journal of the American Statistical Association, (just-\naccepted), 2017.\n\n[WJB+90] Elizabeth Word, John Johnston, Helen P Bain, BD Fulton, Jayne B Zaharias, Charles M\nAchilles, Martha N Lintz, John Folger, and Carolyn Breda. The state of Tennessee\u2019s\nstudent/teacher achievement ratio (STAR) project. Tennessee Board of Education, 1990.\n\n10\n\n\f", "award": [], "sourceid": 6972, "authors": [{"given_name": "Nathan", "family_name": "Kallus", "institution": "Cornell University"}, {"given_name": "Aahlad Manas", "family_name": "Puli", "institution": "NYU"}, {"given_name": "Uri", "family_name": "Shalit", "institution": "Technion"}]}