{"title": "Kernel Instrumental Variable Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 4593, "page_last": 4605, "abstract": "Instrumental variable (IV) regression is a strategy for learning causal relationships in observational data. If measurements of input X and output Y are confounded, the causal relationship can nonetheless be identified if an instrumental variable Z is available that influences X directly, but is conditionally independent of Y given X and the unmeasured confounder. The classic two-stage least squares algorithm (2SLS) simplifies the estimation problem by modeling all relationships as linear functions. We propose kernel instrumental variable regression (KIV), a nonparametric generalization of 2SLS, modeling relations among X, Y, and Z as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs). We prove the consistency of KIV under mild assumptions, and derive conditions under which convergence occurs at the minimax optimal rate for unconfounded, single-stage RKHS regression. In doing so, we obtain an efficient ratio between training sample sizes used in the algorithm's first and second stages. In experiments, KIV outperforms state of the art alternatives for nonparametric IV regression.", "full_text": "Kernel Instrumental Variable Regression\n\nRahul Singh\nMIT Economics\n\nrahul.singh@mit.edu\n\nManeesh Sahani\nGatsby Unit, UCL\n\nmaneesh@gatsby.ucl.ac.uk\n\nArthur Gretton\nGatsby Unit, UCL\n\narthur.gretton@gmail.com\n\nAbstract\n\nInstrumental variable (IV) regression is a strategy for learning causal relationships\nin observational data. If measurements of input X and output Y are confounded,\nthe causal relationship can nonetheless be identi\ufb01ed if an instrumental variable\nZ is available that in\ufb02uences X directly, but is conditionally independent of Y\ngiven X and the unmeasured confounder. The classic two-stage least squares\nalgorithm (2SLS) simpli\ufb01es the estimation problem by modeling all relationships\nas linear functions. We propose kernel instrumental variable regression (KIV), a\nnonparametric generalization of 2SLS, modeling relations among X, Y , and Z as\nnonlinear functions in reproducing kernel Hilbert spaces (RKHSs). We prove the\nconsistency of KIV under mild assumptions, and derive conditions under which\nconvergence occurs at the minimax optimal rate for unconfounded, single-stage\nRKHS regression.\nIn doing so, we obtain an ef\ufb01cient ratio between training\nsample sizes used in the algorithm\u2019s \ufb01rst and second stages. In experiments, KIV\noutperforms state of the art alternatives for nonparametric IV regression.\n\n1\n\nIntroduction\n\nInstrumental variable regression is a method in causal statistics for estimating the counterfactual effect\nof input X on output Y using observational data [60]. If measurements of (X, Y ) are confounded,\nthe causal relationship\u2013also called the structural relationship\u2013can nonetheless be identi\ufb01ed if an\ninstrumental variable Z is available, which is independent of Y conditional on X and the unmeasured\nconfounder. Intuitively, Z only in\ufb02uences Y via X, identifying the counterfactual relationship of\ninterest.\nEconomists and epidemiologists use instrumental variables to overcome issues of strategic interaction,\nimperfect compliance, and selection bias. The original application is demand estimation: supply cost\nshifters (Z) only in\ufb02uence sales (Y ) via price (X), thereby identifying counterfactual demand even\nthough prices re\ufb02ect both supply and demand market forces [68, 11]. Randomized assignment of a\ndrug (Z) only in\ufb02uences patient health (Y ) via actual consumption of the drug (X), identifying the\ncounterfactual effect of the drug even in the scenario of imperfect compliance [3]. Draft lottery number\n(Z) only in\ufb02uences lifetime earnings (Y ) via military service (X), identifying the counterfactual\neffect of military service on earnings despite selection bias in enlistment [2].\nThe two-stage least squares algorithm (2SLS), widely used in economics, simpli\ufb01es the IV estima-\ntion problem by assuming linear relationships: in stage 1, perform linear regression to obtain the\nconditional means \u00afx(z) := EX|Z=z(X); in stage 2, linearly regress outputs Y on these conditional\nmeans. 2SLS works well when the underlying assumptions hold. In practice, the relation between Y\nand X may not be linear, nor may be the relation between X and Z.\nIn the present work, we introduce kernel instrumental variable regression (KIV), an easily imple-\nmented nonlinear generalization of 2SLS (Sections 3 and 4).1 In stage 1 we learn a conditional\n\n1Code: https://github.com/r4hu1-5in9h/KIV\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmean embedding, which is the conditional expectation \u00b5(z) := EX|Z=z (X) of features which\nmap X to a reproducing kernel Hilbert space (RKHS) [56]. For a suf\ufb01ciently rich RKHS, called a\ncharacteristic RKHS, the mean embedding of a random variable is injective [57]. It follows that the\nconditional mean embedding characterizes the full distribution of X conditioned on Z, and not just\nthe conditional mean. We then implement stage 2 via kernel ridge regression of outputs Y on these\nconditional mean embeddings, following the two-stage distribution regression approach described by\n[64, 65]. As in our work, the inputs for [64, 65] are distribution embeddings. Unlike our case, the\nearlier work uses unconditional embeddings computed from independent samples.\nAs a key contribution of our work, we provide consistency guarantees for the KIV algorithm for an\nincreasing number of training samples in stages 1 and 2 (Section 5). To establish stage 1 convergence,\nwe note that the conditional mean embedding [56] is the solution to a regression problem [34, 35, 33],\nand thus equivalent to kernel dependency estimation [20, 21]. We prove that the kernel estimator of\nthe conditional mean embedding (equivalently, the conditional expectation operator) converges in\nRKHS-norm, generalizing classic results by [53, 54]. We allow the conditional mean embedding\nRKHS to be in\ufb01nite-dimensional, which presents speci\ufb01c challenges that we carefully address in our\nanalysis. We also discuss previous approaches to establishing consistency in both \ufb01nite-dimensional\n[35] and in\ufb01nite-dimensional [56, 55, 31, 37, 20] settings.\nWe embed the stage 1 rates into stage 2 to get end-to-end guarantees for the two-stage procedure,\nadapting [14, 64, 65]. In particular, we provide a ratio of stage 1 to stage 2 samples required for\nminimax optimal rates in the second stage, where the ratio depends on the dif\ufb01culty of each stage.\nWe anticipate that these proof strategies will apply generally in two-stage regression settings.\n\n2 Related work\n\nSeveral approaches have been proposed to generalize 2SLS to the nonlinear setting, which we will\ncompare in our experiments (Section 6). A \ufb01rst generalization is via basis function approximation\n[48], an approach called sieve IV, with uniform convergence rates in [17]. The challenge in [17]\nis how to de\ufb01ne an appropriate \ufb01nite dictionary of basis functions. In a second approach, [16, 23]\nimplement stage 1 by computing the conditional distribution of the input X given the instrument Z\nusing a ratio of Nadaraya-Watson density estimates. Stage 2 is then ridge regression in the space\nof square integrable functions. The overall algorithm has a \ufb01nite sample consistency guarantee,\nassuming smoothness of the (X, Z) joint density in stage 1 and the regression in stage 2 [23]. Unlike\nour bound, [23] make no claim about the optimality of the result. Importantly, stage 1 requires the\nsolution of a statistically challenging problem: conditional density estimation. Moreover, analysis\nassumes the same number of training samples used in both stages. We will discuss this bound in\nmore detail in Appendix A.2.1 (we suggest that the reader \ufb01rst cover Section 5).\nOur work also relates to kernel and IV approaches to learning dynamical systems, known in machine\nlearning as predictive state representation models (PSRs) [12, 37, 26] and in econometrics as panel\ndata models [1, 6]. In this setting, predictive states (expected future features given history) are\nupdated in light of new observations. The calculation of the predictive states corresponds to stage 1\nregression, and the states are updated via stage 2 regression. In the kernel case, the predictive states\nare expressed as conditional mean embeddings [12], as in our setting. Performance of the kernel PSR\nmethod is guaranteed by a \ufb01nite sample bound [37, Theorem 2], however this bound is not minimax\noptimal. Whereas [37] assume an equal number of training samples in stages 1 and 2, we \ufb01nd that\nunequal numbers of training samples matter for minimax optimality. More importantly, the bound\nmakes strong smoothness assumptions on the inputs to the stage 1 and stage 2 regression functions,\nrather than assuming smoothness of the regression functions as we do. We show that the smoothness\nassumptions on the inputs made in [37] do not hold in our setting, and we obtain stronger end-to-end\nbounds under more realistic conditions. We discuss the PSR bound in more detail in Appendix A.2.2.\nYet another recent approach is deep IV, which uses neural networks in both stages and permits learning\neven for complex high-dimensional data such as images [36]. Like [23], [36] implement stage 1 by\nestimating a conditional density. Unlike [23], [36] use a mixture density network [9, Section 5.6], i.e.\na mixture model parametrized by a neural network on the instrument Z. Stage 2 is neural network\nregression, trained using stochastic gradient descent (SGD). This presents a challenge: each step of\nSGD requires expectations using the stage 1 model, which are computed by drawing samples and\naveraging. An unbiased gradient estimate requires two independent sets of samples from the stage\n\n2\n\n\f1 model [36, eq. 10], though a single set of samples may be used if an upper bound on the loss is\noptimized [36, eq. 11]. By contrast, our stage 1 outputs\u2013conditional mean embeddings\u2013have a closed\nform solution and exhibit lower variance than sample averaging from a conditional density model.\nNo theoretical guarantee on the consistency of the neural network approach has been provided.\nIn the econometrics literature, a few key assumptions make learning a nonparametric IV model\ntractable. These include the completeness condition [48]: the structural relationship between X and\nY can be identi\ufb01ed only if the stage 1 conditional expectation is injective. Subsequent works impose\nadditional stability and link assumptions [10, 19, 17]: the conditional expectation of a function of\nX given Z is a smooth function of Z. We adapt these assumptions to our setting, replacing the\ncompleteness condition with the characteristic property [57], and replacing the stability and link\nassumptions with the concept of prior [54, 14]. We describe the characteristic and prior assumptions\nin more detail below.\nExtensive use of IV estimation in applied economic research has revealed a common pitfall: weak\ninstrumental variables. A weak instrument satis\ufb01es Hypothesis 1 below, but the relationship between\na weak instrument Z and input X is negligible; Z is essentially irrelevant. In this case, IV estimation\nbecomes highly erratic [13]. In [58], the authors formalize this phenomenon with local analysis. See\n[44, 61] for practical and theoretical overviews, respectively. We recommend that practitioners resist\nthe temptation to use many weak instruments, and instead use few strong instruments such as those\ndescribed in the introduction.\nFinally, our analysis connects early work on the RKHS with recent developments in the RKHS\nliterature. In [46], the authors introduce the RKHS to solve known, ill-posed functional equations. In\nthe present work, we introduce the RKHS to estimate the solution to an uncertain, ill-posed functional\nequation. In this sense, casting the IV problem in an RKHS framework is not only natural; it is in the\noriginal spirit of RKHS methods. For a comprehensive review of existing work and recent advances\nin kernel mean embedding research, we recommend [43, 32].\n\n3 Problem setting and de\ufb01nitions\n\nInstrumental variable: We begin by introducing our causal assumption about the instrument. This\nprior knowledge, described informally in the introduction, allows us to recover the counterfactual\neffect of X on Y . Let (X ,BX ), (Y,BY ), and (Z,BZ ) be measurable spaces. Let (X, Y, Z) be a\nrandom variable on X\u21e5Y\u21e5Z\nHypothesis 1. Assume\n\nwith distribution \u21e2.\n\n1. Y = h(X) + e and E[e|Z] = 0\n2. \u21e2(x|z) is not constant in z\n\nWe call h the structural function of interest. The error term e is unmeasured, confounding noise.\nHypothesis 1.1, known as the exclusion restriction, was introduced by [48] to the nonparametric IV\nliterature for its tractability. Other hypotheses are possible, although a very different approach is then\nneeded [40]. Hypothesis 1.2, known as the relevance condition, ensures that Z is actually informative.\nIn Appendix A.1.1, we compare Hypothesis 1 with alternative formulations of the IV assumption.\nWe make three observations. First, if X = Z then Hypothesis 1 reduces to the standard regression\nassumption of unconfounded inputs, and h(X) = E[Y |X]; if X = Z then prediction and counter-\nfactual prediction coincide. The IV model is a framework that allows for causal inference in a more\ngeneral variety of contexts, namely when h(X) 6= E[Y |X] so that prediction and counterfactual\nprediction are different learning problems. Second, Hypothesis 1 will permit identi\ufb01cation of h even\nif inputs are confounded, i.e. X|=e. Third, this model includes the scenario in which the analyst\nhas a combination of confounded and unconfounded inputs. For example, in demand estimation\nthere may be confounded price P , unconfounded characteristics W , and supply cost shifter C that\ninstruments for price. Then X = (P, W ), Z = (C, W ), and the analysis remains the same.\nHypothesis 1 provides the operator equation E[Y |Z] = EX|Zh(X) [48]. In the language of 2SLS,\nthe LHS is the reduced form, while the RHS is a composition of stage 1 linear compact operator\nEX|Z and stage 2 structural function h. In the language of functional analysis, the operator equation\nis a Fredholm integral equation of the \ufb01rst kind [46, 41, 48, 29]. Solving this operator equation for\n\n3\n\n\fh involves inverting a linear compact operator with in\ufb01nite-dimensional domain; it is an ill-posed\nproblem [41]. To recover a well-posed problem, we impose smoothness and Tikhonov regularization.\nRKHS model: We next introduce our RKHS model. Let kX : X\u21e5\nX! R and kZ : Z\u21e5Z! R be measurable positive de\ufb01nite kernels\ncorresponding to scalar-valued RKHSs HX and HZ. Denote the feature\nmaps\n\nHZ\n\nZ\n\n\n\n\u00b5 2H \u2305\n\nE\u21e4 2H \u21e4\n\n \n\nHX\n\n : X!H X , x 7! kX (x,\u00b7)\n\n : Z!H Z , z 7! kZ (z,\u00b7)\n\nX\n\nh 2H X\n\nH 2H \u2326\n\nY\nFigure 1: The RKHSs\n\nDe\ufb01ne the conditional expectation operator E : HX !H Z such that\n[Eh](z) = EX|Z=zh(X). E is the natural object of interest for stage\n1. We de\ufb01ne and analyze an estimator for E directly. The conditional\nexpectation operator E conveys exactly the same information as another\nobject popular in the kernel methods literature, the conditional mean\nembedding \u00b5 : Z!H X de\ufb01ned by \u00b5(z) = EX|Z=z (X) [56]. Indeed,\n\u00b5(z) = E\u21e4(z) where E\u21e4 : HZ !H X is the adjoint of E. Analogously,\nin 2SLS \u00afx(z) = \u21e10z for stage 1 linear regression parameter \u21e1.\nThe structural function h : X!Y in Hypothesis 1 is the natural object of interest for stage 2. For\ntheoretical purposes, it is convenient to estimate h indirectly. The structural function h conveys\nexactly the same information as an object we call the structural operator H : HX !Y . Indeed,\nh(x) = H (x). Analogously, in 2SLS h(x) = 0x for structural parameter . We de\ufb01ne and analyze\nan estimator for H, which in turn implies an estimator for h. Figure 1 summarizes the relationships\namong equivalent stage 1 objects (E, \u00b5) and equivalent stage 2 objects (H, h).\nOur RKHS model for the IV problem is of the same form as the model in [45, 46, 47] for general\noperator equations. We begin by choosing RKHSs for the structural function h and the reduced\nform E[Y |Z], then construct a tensor-product RKHS for the conditional expectation operator E. Our\nmodel differs from the RKHS model proposed by [16, 23], which directly learns the conditional\nexpectation operator E via Nadaraya-Watson density estimation. The RKHSs of [28, 16, 23] for\nthe structural function h and the reduced form E[Y |Z] are de\ufb01ned from the right and left singular\nfunctions of E, respectively. They appear in the consistency argument, but not in the ridge penalty.\n\n4 Learning problem and algorithm\n\n2SLS consists of two stages that can be estimated separately. Sample splitting in this context means\nestimating stage 1 with n randomly chosen observations and estimating stage 2 with the remaining m\nobservations. Sample splitting alleviates the \ufb01nite sample bias of 2SLS when instrument Z weakly\nin\ufb02uences input X [4]. It is the natural approach when an analyst does not have access to a single\ndata set with n + m observations of (X, Y, Z) but rather two data sets: n observations of (X, Z),\nand m observations of (Y, Z). We employ sample splitting in KIV, with an ef\ufb01cient ratio of (n, m)\ngiven in Theorem 4. In our presentation of the general two-stage learning problem, we denote stage 1\nobservations by (xi, zi) and stage 2 observations by (\u02dcyi, \u02dczi).\n\n4.1 Stage 1\nWe transform the problem of learning E into a vector-valued kernel ridge regression following\n[34, 33, 20], where the hypothesis space is the vector-valued RKHS H of operators mapping HX to\nHZ. In Appendix A.3, we review the theory of vector-valued RKHSs as it relates to scalar-valued\nRKHSs and tensor product spaces. The key result is that the tensor product space of HX and HZ is\nisomorphic to L2(HX ,HZ ), the space of Hilbert-Schmidt operators from HX to HZ. If we choose\nthe vector-valued kernel  with feature map (x, z) 7! [(z) \u2326 (x)](\u00b7) = (z)h (x),\u00b7iHX , then\nH = L2(HX ,HZ ) and it shares the same norm.\nWe now state the objective for optimizing E 2H . The optimal E minimizes the expected\ndiscrepancy\n\nE\u21e2 = argminE1(E),\n\nE1(E) = E(X,Z)k (X)  E\u21e4(Z)k2\nHX\n\nBoth [33] and [20] refer to E1 as the surrogate risk. As shown in [34, Section 3.1] and [33], the\nsurrogate risk upper bounds the natural risk for the conditional expectation, where the bound becomes\n\n4\n\n\ftight when EX|Z=(\u00b7)f (X) 2H Z , 8f 2H X . Formally, the target operator is the constrained solution\nEH = argminE2H E1(E). We will assume E\u21e2 2H  so that E\u21e2 = EH.\nNext we impose Tikhonov regularization. The regularized target operator and its empirical analogue\nare given by\n\nE(E) = E1(E) + kEk2\n\nL2(HX ,HZ )\n\nE = argmin\n\nE2H E(E),\nE2H E n\n (E),\n\nEn\n\n = argmin\n\nE n\n (E) =\n\n1\nn\n\nnXi=1\n\nk (xi)  E\u21e4(zi)k2\nHX\n\n+ kEk2\n\nL2(HX ,HZ )\n\nOur construction of a vector-valued RKHS H for the conditional expectation operator E permits\nus to estimate stage 1 by kernel ridge regression. The stage 1 estimator of KIV is at once novel in\nthe nonparametric IV literature and fundamentally similar to 2SLS. Basis function approximation\n[48, 17] is perhaps the closest prior IV approach, but we use in\ufb01nite dictionaries of basis functions \nand . Compared to density estimation [16, 23, 36], kernel ridge regression is an easier problem.\nAlternative stage 1 estimators in the literature estimate the singular system of E to ensure that the\nadjoint of the estimator equals the estimator of the adjoint. These estimators differ in how they\nestimate the singular system: empirical distribution [23], Nadaraya-Watson density [24], or B-spline\nwavelets [18]. The KIV stage 1 estimator has the desired property by construction; (En\n.\n )\u21e4 = (E\u21e4)n\nSee Appendix A.3 for details.\n\n4.2 Stage 2\nNext, we transform the problem of learning h into a scalar-valued kernel ridge regression that respects\nthe IV problem structure. In Proposition 12 of Appendix A.3, we show that under Hypothesis 3\nbelow,\n\nEX|Z=zh(X) = [Eh](z) = hh, \u00b5(z)iHX\n\n= H\u00b5(z)\n\nwhere h 2H X , a scalar-valued RKHS; E 2H , the vector-valued RKHS described above; \u00b5 2H \u2305,\na vector-valued RKHS isometrically isomorphic to H; and H 2H \u2326, a scalar-valued RKHS isomet-\nrically isomorphic to HX . It is helpful to think of \u00b5(z) as the embedding into HX of a distribution\non X indexed by the conditioned value z. When kX is characteristic, \u00b5(z) uniquely embeds the\nconditional distribution, and H is identi\ufb01ed. The kernel \u2326 satis\ufb01es kX (x, x0) =\u2326( (x), (x0)).\nThis expression establishes the formal connection between our model and [64, 65]. The choice of \u2326\nmay be more general; for nonlinear examples see [65, Table 1].\nWe now state the objective for optimizing H 2H \u2326. Hypothesis 1 provides the operator equation,\nwhich may be rewritten as the regression equation\n\nThe unconstrained solution is\n\nY = EX|Zh(X) + eZ = H\u00b5(Z) + eZ, E[eZ|Z] = 0\nE(H) = E(Y,Z)kY  H\u00b5(Z)k2\nH\u21e2 = argminE(H),\nY\n\nThe target operator is the constrained solution HH\u2326 = argminH2H\u2326 E(H). We will assume\nH\u21e2 2H \u2326 so that H\u21e2 = HH\u2326. With regularization,\n\nH\u21e0 = argmin\n\nH2H\u2326 E\u21e0(H),\nH2H\u2326 E m\n\u21e0 (H),\n\nH m\n\n\u21e0 = argmin\n\nE\u21e0(H) = E(H) + \u21e0kHk2\nH\u2326\n\nE m\n\u21e0 (H) =\n\n1\nm\n\nmXi=1\n\nk\u02dcyi  H\u00b5(\u02dczi)k2\n\nY + \u21e0kHk2\nH\u2326\n\nThe essence of the IV problem is this: we do not directly observe the conditional expectation operator\nE (or equivalently the conditional mean embedding \u00b5) that appears in the stage 2 objective. Rather,\nwe approximate it using the estimate from stage 1. Thus our KIV estimator is \u02c6hm\n\n\u21e0 = \u02c6H m\n\n\u21e0 where\n\n\u02c6H m\n\n\u02c6E m\n\u21e0 (H),\n\n\u21e0 = argmin\nH2H\u2326\n\n1\nm\nrepresents the fact that we only have m samples.\nand \u00b5n\nThe transition from H m\nrepresents the fact that we must learn not only the structural operator\n\u21e0\nH but also the conditional expectation operator E. In this sense, the IV problem is more complex\nthan the estimation problem considered by [45, 47] in which E is known.\n\n )\u21e4. The transition from H\u21e2 to H m\n\u21e0\n\nk\u02dcyi  H\u00b5n\n\n(\u02dczi)k2\n\nY + \u21e0kHk2\nH\u2326\n\n\u02c6E m\n\u21e0 (H) =\n\n = (En\n\nto \u02c6H m\n\u21e0\n\nmXi=1\n\n5\n\n\f4.3 Algorithm\n\nWe obtain a closed form expression for the KIV estimator. The apparatus introduced above is required\nfor analysis of consistency and convergence rate. More subtly, our RKHS construction allows us to\nwrite kernel ridge regression estimators for both stage 1 and stage 2, unlike previous work. Because\nKIV consists of repeated kernel ridge regressions, it bene\ufb01ts from repeated applications of the\nrepresenter theorem [66, 51]. Consequently, we have a shortcut for obtaining KIV\u2019s closed form; see\nAppendix A.5.1 for the full derivation.\nAlgorithm 1. Let X and Z be matrices of n observations. Let \u02dcy and \u02dcZ be a vector and matrix of m\nobservations.\n\nW = KXX(KZZ + nI)1KZ \u02dcZ,\n\n\u02c6\u21b5 = (W W 0 + m\u21e0KXX)1W \u02dcy,\n\n\u02c6hm\n\u21e0 (x) = (\u02c6\u21b5)0KXx\n\nwhere KXX and KZZ are the empirical kernel matrices.\n\nTheorems 2 and 4 below theoretically determine ef\ufb01cient rates for the stage 1 regularization parameter\n and stage 2 regularization parameter \u21e0, respectively. In Appendix A.5.2, we provide a validation\nprocedure to empirically determine values for (, \u21e0).\n\n5 Consistency\n\n7! h`, (\u00b7)iHZ\n\nS\u21e41 : HZ ,! L2(Z,\u21e2 Z ),`\n\n5.1 Stage 1\nIntegral operators: We use integral operator notation from the kernel methods literature, adapted to\nthe conditional expectation operator learning problem. We denote by L2(Z,\u21e2 Z ) the space of square\nintegrable functions from Z to Y with respect to measure \u21e2Z, where \u21e2Z is the restriction of \u21e2 to Z.\nDe\ufb01nition 1. The stage 1 (population) operators are\nS1 : L2(Z,\u21e2 Z ) !H Z , \u02dc` 7!Z (z)\u02dc`(z)d\u21e2Z (z)\nT1 = S1  S\u21e41 is the uncentered covariance operator of [30, Theorem 1]. In Appendix A.4.2, we prove\nthat T1 exists and has \ufb01nite trace even when HX and HZ are in\ufb01nite-dimensional. In Appendix A.4.4,\nwe compare T1 with other covariance operators in the kernel methods literature.\nAssumptions: We place assumptions on the original spaces X and Z, the scalar-valued RKHSs HX\nand HZ, and the probability distribution \u21e2(x, z). We maintain these assumptions throughout the\npaper. Importantly, we assume that the vector-valued RKHS regression is correctly speci\ufb01ed: the true\nconditional expectation operator E\u21e2 lives in the vector-valued RKHS H. In further research, we\nwill relax this assumption.\nHypothesis 2. Suppose that X and Z are Polish spaces, i.e. separable and completely metrizable\ntopological spaces\nHypothesis 3. Suppose that\n\n1. kX and kZ are continuous and bounded: supx2X k (x)kHX \uf8ff Q, supz2Z k(z)kHZ \uf8ff \uf8ff\n2. and  are measurable\n3. kX is characteristic [57]\n\nHypothesis 4. Suppose that E\u21e2 2H . Then E1(E\u21e2) = inf E2H E1(E)\nHypothesis 3.3 specializes the completeness condition of [48]. Hypotheses 2-4 are suf\ufb01cient to bound\nthe sampling error of the regularized estimator En\n . Bounding the approximation error requires a\nfurther assumption on the smoothness of the distribution \u21e2(x, z). We assume \u21e2(x, z) belongs to a\nclass of distributions parametrized by (\u21e31, c1), as generalized from [54, Theorem 2] to the space H.\nHypothesis 5. Fix \u21e31 < 1. For given c1 2 (1, 2], de\ufb01ne the prior P(\u21e31, c1) as the set of probability\ndistributions \u21e2 on X\u21e5Z such that a range space assumption is satis\ufb01ed: 9G1 2H  s.t. E\u21e2 =\nT\n\n2\n\nc11\n1\n\n G1 and kG1k2\n\nH \uf8ff \u21e31\n\n6\n\n\fWe use composition symbol  to emphasize that G1 : HX !H Z and T1 : HZ !H Z. We de\ufb01ne\nthe power of operator T1 with respect to its eigendecomposition; see Appendix A.4.2 for formal\njusti\ufb01cation. Larger c1 corresponds to a smoother conditional expectation operator E\u21e2. Proposition 24\nin Appendix A.6.2 shows E\u21e4\u21e2(z) = \u00b5(z), so Hypothesis 5 is an indirect smoothness condition on\nthe conditional mean embedding \u00b5.\nEstimation and convergence: The estimator has a closed form solution, as noted in [34, Section\n3.1] and [35, Appendix D]; [20] use it in the \ufb01rst stage of the structured prediction problem. We\npresent the closed form solution in notation similar to [14] in order to elucidate how the estimator\nsimply generalizes linear regression. This connection foreshadows our proof technique.\nTheorem 1. 8> 0, the solution En\n\n exists, is unique, and\n\n = (T1 + )1  g1, T1 =\nEn\n\n of the regularized empirical objective E n\nnXi=1\n\n(zi) \u2326 (zi),\n\nnXi=1\n\ng1 =\n\n1\nn\n\n1\nn\n\n(zi) \u2326 (xi)\n\nWe prove an original, \ufb01nite sample bound on the RKHS-norm distance of the estimator En\ntarget E\u21e2. The proof is in Appendix A.7.\nTheorem 2. Assume Hypotheses 2-5. 8 2 (0, 1), the following holds w.p. 1  :\n\u2713 4\uf8ff(Q + \uf8ffkE\u21e2kH) ln(2/)\n\u25c6 2\n\np\u21e31(c1 + 1)\n =\u2713 8\uf8ff(Q + \uf8ffkE\u21e2kH) ln(2/)\n\n  E\u21e2kH \uf8ff rE(, n, c1) :=\n\npn\u21e31(c1  1)\n\npn\u21e31(c1  1)\n\n\u25c6 c11\n\nkEn\n\nc1+1\n\nc1+1\n\nc1+1\n\n1\n\n4\n\n from its\n\nThe ef\ufb01cient rate of  is n 1\nmeasures the smoothness of the conditional expectation operator E\u21e2.\n\nc1+1 . Note that the convergence rate of En\n\n is calibrated by c1, which\n\n5.2 Stage 2\nIntegral operators: We use integral operator notation from the kernel methods literature, adapted to\nthe structural operator learning problem. We denote by L2(HX ,\u21e2 HX\n) the space of square integrable\nfunctions from HX to Y with respect to measure \u21e2HX , where \u21e2HX is the extension of \u21e2 to HX [59,\nLemma A.3.16]. Note that we present stage 2 analysis for general output space Y as in [64, 65],\nthough in practice we only consider Y\u21e2 R to simplify our two-stage RKHS model.\nDe\ufb01nition 2. The stage 2 (population) operators are\n\nS\u21e4 : H\u2326 ,! L2(HX ,\u21e2 HX\nS : L2(HX ,\u21e2 HX\n\n), H 7! \u2326\u21e4(\u00b7)H\n\n) !H \u2326, \u02dcH 7!Z \u2326\u00b5(z)  \u02dcH\u00b5(z)d\u21e2HX\n\n(\u00b5(z))\n\nwhere \u2326\u00b5(z) : Y!H \u2326 de\ufb01ned by y 7! \u2326(\u00b7, \u00b5(z))y is the point evaluator of [42, 15]. Finally de\ufb01ne\nT\u00b5(z) =\u2326 \u00b5(z)  \u2326\u21e4\u00b5(z) and covariance operator T = S  S\u21e4.\nAssumptions: We place assumptions on the original space Y, the scalar-valued RKHS H\u2326, and\nthe probability distribution \u21e2. Importantly, we assume that the scalar-valued RKHS regression is\ncorrectly speci\ufb01ed: the true structural operator H\u21e2 lives in the scalar-valued RKHS H\u2326.\nHypothesis 6. Suppose that Y is a Polish space\nHypothesis 7. Suppose that\n\n1. The {\u2326\u00b5(z)} operator family is uniformly bounded in Hilbert-Schmidt norm: 9B s.t. 8\u00b5(z),\n\nk\u2326\u00b5(z)k2\n\nL2(Y,H\u2326) = T r(\u2326\u21e4\u00b5(z)  \u2326\u00b5(z)) \uf8ff B\n\n2. The {\u2326\u00b5(z)} operator family is H\u00f6lder continuous in operator norm: 9L > 0, \u25c6 2 (0, 1] s.t.\n\n8\u00b5(z), \u00b5(z0), k\u2326\u00b5(z)  \u2326\u00b5(z0)kL(Y,H\u2326) \uf8ff Lk\u00b5(z)  \u00b5(z0)k\u25c6\nHX\n\nLarger \u25c6 is interpretable as smoother kernel \u2326.\n\n7\n\n\fHypothesis 8. Suppose that\n\n1. H\u21e2 2H \u2326. Then E(H\u21e2) = inf H2H\u2326 E(H)\n2. Y is bounded, i.e. 9C < 1 s.t. kY kY \uf8ff C almost surely\n\nThe convergence rate from stage 1 together with Hypotheses 6-8 are suf\ufb01cient to bound the excess\nerror of the regularized estimator \u02c6H m\nin terms of familiar objects in the kernel methods literature,\n\u21e0\nnamely the residual, reconstruction error, and effective dimension. We further assume \u21e2 belongs to a\nstage 2 prior to simplify these bounds. In particular, we assume \u21e2 belongs to a class of distributions\nparametrized by (\u21e3, b, c) as de\ufb01ned originally in [14, De\ufb01nition 1], restated below.\nHypothesis 9. Fix \u21e3< 1. For given b 2 (1,1] and c 2 (1, 2], de\ufb01ne the prior P(\u21e3, b, c) as the set\nof probability distributions \u21e2 on HX \u21e5Y such that\n\n1. A range space assumption is satis\ufb01ed: 9G 2H \u2326 s.t. H\u21e2 = T\n2. In the spectral decomposition T = P1k=1 kekh\u00b7, ekiH\u2326, where {ek}1k=1 is a basis of\nKer(T )?, the eigenvalues satisfy \u21b5 \uf8ff kbk \uf8ff  for some \u21b5,  > 0\n\n2 G and kGk2\n\nH\u2326 \uf8ff \u21e3\n\nc1\n\nWe de\ufb01ne the power of operator T with respect to its eigendecomposition; see Appendix A.4.2\nfor formal justi\ufb01cation. The latter condition is interpretable as polynomial decay of eigenvalues:\nk =\u21e5( kb). Larger b means faster decay of eigenvalues of the covariance operator T and hence\nsmaller effective input dimension. Larger c corresponds to a smoother structural operator H\u21e2 [65].\nEstimation and convergence: The estimator has a closed form solution, as shown by [64, 65] in the\nsecond stage of the distribution regression problem. We present the solution in notation similar to\n[14] to elucidate how the stage 1 and stage 2 estimators have the same structure.\nTheorem 3. 8\u21e0> 0, the solution H m\n\n\u21e0 and the solution \u02c6H m\n\n\u21e0 exist, are unique, and\n\n\u21e0\n\nto E m\n\u21e0 = (T + \u21e0)1g, T =\n\n\u21e0\n\nH m\n\nT\u00b5(\u02dczi),\n\ng =\n\n\u2326\u00b5(\u02dczi) \u02dcyi\n\n\u02c6H m\n\n\u21e0 = ( \u02c6T + \u21e0)1\u02c6g,\n\n\u02c6T =\n\n1\nm\n\nT\u00b5n\n\n(\u02dczi),\n\n\u02c6g =\n\n\u2326\u00b5n\n\n(\u02dczi) \u02dcyi\n\n1\nm\n\nmXi=1\nmXi=1\n\nto \u02c6E m\nmXi=1\n1\nm\nmXi=1\n\n1\nm\n\nWe now present this paper\u2019s main theorem. In Appendix A.10, we provide a \ufb01nite sample bound\non the excess error of the estimator \u02c6H m\n\u21e0 with respect to its target H\u21e2. Adapting arguments by [65],\nwe demonstrate that KIV is able to achieve the minimax optimal single-stage rate derived by [14].\nIn other words, our two-stage estimator is able to learn the causal relationship with confounded\ndata equally well as single-stage RKHS regression is able to learn the causal relationship with\nunconfounded data.\nTheorem 4. Assume Hypotheses 1-9. Choose  = n 1\n\u21e0 ) E (H\u21e2) = Op(m ac\n\u21e0 ) E (H\u21e2) = Op(m bc\n\nbc+1 then E( \u02c6H m\nbc+1 then E( \u02c6H m\n\n1. If a \uf8ff b(c+1)\n2. If a  b(c+1)\n\na(c1+1)\n\u25c6(c11) where a > 0.\n\nbc+1 ) with \u21e0 = m b\n\nc+1 ) with \u21e0 = m a\n\nc1+1 and n = m\n\nbc+1\n\nc+1\n\nbc+1 < 2, the convergence rate m bc\n\nAt a = b(c+1)\nbc+1 is minimax optimal while requiring the fewest\nobservations [65]. This statistically ef\ufb01cient rate is calibrated by b, the effective input dimension, as\nwell as c, the smoothness of structural operator H\u21e2 [14]. The ef\ufb01cient ratio between stage 1 and stage\n2 samples is n = m\n\u25c6(c11) , implying n > m. As far as we know, asymmetric sample splitting\nis a novel prescription in the IV literature; previous analyses assume n = m [4, 37].\n\nbc+1 \u00b7 (c1+1)\n\nb(c+1)\n\n6 Experiments\n\nWe compare the empirical performance of KIV (KernelIV) to four leading competitors: standard\nkernel ridge regression (KernelReg) [50], Nadaraya-Watson IV (SmoothIV) [16, 23], sieve IV\n\n8\n\n\f)\ng\no\nl\n(\n \n\nl\n\n1\n\n5\n\n10\n\nSigmoid\n\n0\n\n-1\n\n-2\n\n-\nf\no\n-\nt\nu\nO\n\nAlgorithm\n\nE\nS\nM\n \ne\np\nm\na\nS\n\n Training Sample Size (1000)\n\nKernelReg\nSmoothIV\nSieveIV(ridge)\nDeepIV\nKernelIV\n\n(SieveIV) [48, 17], and deep IV (DeepIV) [36]. To improve the performance of sieve IV, we impose\nTikhonov regularization in both stages with KIV\u2019s tuning procedure. This adaptation exceeds the\ntheoretical justi\ufb01cation provided by [17]. However, it is justi\ufb01ed by our analysis insofar as sieve IV is\na special case of KIV: set feature maps ,  equal to the sieve bases.\nWe implement each estimator on three designs. The linear design\n[17] involves learning counterfactual function h(x) = 4x  2,\ngiven confounded observations of continuous variables (X, Y )\nas well as continuous instrument Z. The sigmoid design [17]\ninvolves learning counterfactual function h(x) = ln(|16x  8| +\n1) \u00b7 sgn(x  0.5) under the same regime. The demand design\n[36] involves learning demand function h(p, t, s) = 100 + (10 +\np) \u00b7 s \u00b7 (t)  2p where (t) is the complex nonlinear function\nin Figure 6. An observation consists of (Y, P, T, S, C) where Y\nis sales, P is price, T is time of year, S is customer sentiment (a\ndiscrete variable), and C is a supply cost shifter. The parameter\n\u21e2 2{ 0.9, 0.75, 0.5, 0.25, 0.1} calibrates the extent to which price\nP is confounded by supply-side market forces. In KIV notation,\ninputs are X = (P, T, S) and instruments are Z = (C, T, S).\nFor each algorithm, design, and sample size, we implement 40 simulations and calculate MSE with\nrespect to the true structural function h. Figures 2, 3, and 10 visualize results. In the sigmoid design,\nKernelIV performs best across sample sizes. In the demand design, KernelIV performs best for\nsample size n + m = 1000 and rivals DeepIV for sample size n + m = 5000. KernelReg ignores\nthe instrument Z, and it is biased away from the structural function due to confounding noise e. This\nphenomenon can have counterintuitive consequences. Figure 3 shows that in the highly nonlinear\ndemand design, KernelReg deviates further from the structural function as sample size increases\nbecause the algorithm is further misled by confounded data. Figure 2 of [36] documents the same\neffect when a feedforward neural network is applied to the same data. The remaining algorithms\nmake use of the instrument Z to overcome this issue.\nKernelIV improves on SieveIV in the same way that kernel ridge regression improves on ridge\nregression: by using an in\ufb01nite dictionary of implicit basis functions rather than a \ufb01nite dictionary\nof explicit basis functions. KernelIV improves on SmoothIV by using kernel ridge regression in\nnot only stage 2 but also stage 1, avoiding costly density estimation. Finally, it improves on DeepIV\nby directly learning stage 1 mean embeddings, rather than performing costly density estimation and\nsampling from the estimated density. Remarkably, with training sample size of only n + m = 1000,\nKernelIV has essentially learned as much as it can learn from the demand design. See Appendix A.11\nfor representative plots, implementation details, and a robustness study.\n\nFigure 2: Sigmoid design\n\n)\ng\no\nl\n(\n \n\n \n\nE\nS\nM\ne\np\nm\na\nS\n\nl\n\n-\nf\n\no\n-\nt\n\nu\nO\n\nDemand 0.9\n\nDemand 0.75\n\nDemand 0.5\n\nDemand 0.25\n\nDemand 0.1\n\n5.0\n4.5\n4.0\n3.5\n3.0\n\n1\n\n5\n\n10\n\n1\n\n5\n\n10\n1\nTraining Sample Size (1000)\n\n10\n\n1\n\n5\n\n5\n\n10\n\n1\n\n5\n\n10\n\nAlgorithm\n\nKernelReg\nSmoothIV\nSieveIV(ridge)\nDeepIV\nKernelIV\n\nFigure 3: Demand design\n\n7 Conclusion\n\nWe introduce KIV, an algorithm for learning a nonlinear, causal relationship from confounded\nobservational data. KIV is easily implemented and minimax optimal. As a contribution to the IV\nliterature, we show how to estimate the stage 1 conditional expectation operator\u2013an in\ufb01nite by in\ufb01nite\ndimensional object\u2013by kernel ridge regression. As a contribution to the kernel methods literature, we\nshow how the RKHS is well-suited to causal inference and ill-posed inverse problems. In simulations,\nKIV outperforms state of the art algorithms for nonparametric IV regression. The success of KIV\nsuggests RKHS methods may be an effective bridge between econometrics and machine learning.\n\n9\n\n\fAcknowledgments\nWe are grateful to Alberto Abadie, Anish Agarwal, Michael Arbel, Victor Chernozhukov, Geoffrey\nGordon, Jason Hartford, Motonobu Kanagawa, Anna Mikusheva, Whitney Newey, Nakul Singh,\nBharath Sriperumbudur, and Suhas Vijaykumar. This project was made possible by the Marshall Aid\nCommemoration Commission.\n\nReferences\n\n[1] Theodore W Anderson and Cheng Hsiao. Estimation of dynamic models with error components.\n\nJournal of the American Statistical Association, 76(375):598\u2013606, 1981.\n\n[2] Joshua D Angrist. Lifetime earnings and the Vietnam era draft lottery: Evidence from Social\n\nSecurity administrative records. The American Economic Review, pages 313\u2013336, 1990.\n\n[3] Joshua D Angrist, Guido W Imbens, and Donald B Rubin. Identi\ufb01cation of causal effects using\ninstrumental variables. Journal of the American Statistical Association, 91(434):444\u2013455, 1996.\n[4] Joshua D Angrist and Alan B Krueger. Split-sample instrumental variables estimates of the\n\nreturn to schooling. Journal of Business & Economic Statistics, 13(2):225\u2013235, 1995.\n\n[5] Michael Arbel and Arthur Gretton. Kernel conditional exponential family. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 1337\u20131346, 2018.\n\n[6] Manuel Arellano and Stephen Bond. Some tests of speci\ufb01cation for panel data: Monte Carlo\nevidence and an application to employment equations. The Review of Economic Studies,\n58(2):277\u2013297, 1991.\n\n[7] Jordan Bell. Trace class operators and Hilbert-Schmidt operators. Technical report, University\n\nof Toronto Department of Mathematics, 2016.\n\n[8] Alain Berlinet and Christine Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability\n\nand Statistics. Springer, 2011.\n\n[9] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[10] Richard Blundell, Xiaohong Chen, and Dennis Kristensen. Semi-nonparametric IV estimation\n\nof shape-invariant Engel curves. Econometrica, 75(6):1613\u20131669, 2007.\n\n[11] Richard Blundell, Joel L Horowitz, and Matthias Parey. Measuring the price responsive-\nness of gasoline demand: Economic shape restrictions and nonparametric demand estimation.\nQuantitative Economics, 3(1):29\u201351, 2012.\n\n[12] Byron Boots, Arthur Gretton, and Geoffrey J Gordon. Hilbert space embeddings of predictive\n\nstate representations. In Uncertainty in Arti\ufb01cial Intelligence, pages 92\u2013101, 2013.\n\n[13] John Bound, David A Jaeger, and Regina M Baker. Problems with instrumental variables\nestimation when the correlation between the instruments and the endogenous explanatory\nvariable is weak. Journal of the American Statistical Association, 90(430):443\u2013450, 1995.\n\n[14] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares\n\nalgorithm. Foundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[15] Claudio Carmeli, Ernesto De Vito, and Alessandro Toigo. Vector valued reproducing kernel\nHilbert spaces of integrable functions and Mercer theorem. Analysis and Applications, 4(4):377\u2013\n408, 2006.\n\n[16] Marine Carrasco, Jean-Pierre Florens, and Eric Renault. Linear inverse problems in structural\neconometrics estimation based on spectral decomposition and regularization. Handbook of\nEconometrics, 6:5633\u20135751, 2007.\n\n[17] Xiaohong Chen and Timothy M Christensen. Optimal sup-norm rates and uniform inference\non nonlinear functionals of nonparametric IV regression. Quantitative Economics, 9(1):39\u201384,\n2018.\n\n10\n\n\f[18] Xiaohong Chen, Lars P Hansen, and Jose Scheinkman. Shape-preserving estimation of diffu-\n\nsions. Technical report, University of Chicago Department of Economics, 1997.\n\n[19] Xiaohong Chen and Demian Pouzo. Estimation of nonparametric conditional moment models\n\nwith possibly nonsmooth generalized residuals. Econometrica, 80(1):277\u2013321, 2012.\n\n[20] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach\nfor structured prediction. In Advances in Neural Information Processing Systems, pages 4412\u2013\n4420, 2016.\n\n[21] Corinna Cortes, Mehryar Mohri, and Jason Weston. A general regression technique for learning\n\ntransductions. In International Conference on Machine Learning, pages 153\u2013160, 2005.\n\n[22] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the\n\nAmerican mathematical society, 39(1):1\u201349, 2002.\n\n[23] Serge Darolles, Yanqin Fan, Jean-Pierre Florens, and Eric Renault. Nonparametric instrumental\n\nregression. Econometrica, 79(5):1541\u20131565, 2011.\n\n[24] Serge Darolles, Jean-Pierre Florens, and Christian Gourieroux. Kernel-based nonlinear canonical\n\nanalysis and time reversibility. Journal of Econometrics, 119(2):323\u2013353, 2004.\n\n[25] Ernesto De Vito and Andrea Caponnetto. Risk bounds for regularized least-squares algorithm\n\nwith operator-value kernels. Technical report, MIT CSAIL, 2005.\n\n[26] Carlton Downey, Ahmed Hefny, Byron Boots, Geoffrey J Gordon, and Boyue Li. Predictive\nstate recurrent neural networks. In Advances in Neural Information Processing Systems, pages\n6053\u20136064, 2017.\n\n[27] Vincent Dutordoir, Hugh Salimbeni, James Hensman, and Marc Deisenroth. Gaussian process\nconditional density estimation. In Advances in Neural Information Processing Systems, pages\n2385\u20132395, 2018.\n\n[28] Heinz W Engl, Martin Hanke, and Andreas Neubauer. Regularization of Inverse Problems,\n\nvolume 375. Springer Science & Business Media, 1996.\n\n[29] Jean-Pierre Florens. Inverse problems and structural econometrics. In Advances in Economics\n\nand Econometrics: Theory and Applications, volume 2, pages 46\u201385, 2003.\n\n[30] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised\nlearning with reproducing kernel Hilbert spaces. Journal of Machine Learning Research,\n5(Jan):73\u201399, 2004.\n\n[31] Kenji Fukumizu, Le Song, and Arthur Gretton. Kernel Bayes\u2019 rule: Bayesian inference with\n\npositive de\ufb01nite kernels. Journal of Machine Learning Research, 14(1):3753\u20133783, 2013.\n\n[32] Arthur Gretton. RKHS in machine learning: Testing statistical dependence. Technical report,\n\nUCL Gatsby Unit, 2018.\n\n[33] Steffen Gr\u00fcnew\u00e4lder, Arthur Gretton, and John Shawe-Taylor. Smooth operators. In Interna-\n\ntional Conference on Machine Learning, pages 1184\u20131192, 2013.\n\n[34] Steffen Gr\u00fcnew\u00e4lder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, and Massi-\nmiliano Pontil. Conditional mean embeddings as regressors. In International Conference on\nMachine Learning, volume 5, 2012.\n\n[35] Steffen Gr\u00fcnew\u00e4lder, Guy Lever, Luca Baldassarre, Massimilano Pontil, and Arthur Gretton.\nModelling transition dynamics in MDPs with RKHS embeddings. In International Conference\non Machine Learning, pages 1603\u20131610, 2012.\n\n[36] Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A \ufb02exible\napproach for counterfactual prediction. In International Conference on Machine Learning,\npages 1414\u20131423, 2017.\n\n11\n\n\f[37] Ahmed Hefny, Carlton Downey, and Geoffrey J Gordon. Supervised learning for dynamical\nsystem learning. In Advances in Neural Information Processing Systems, pages 1963\u20131971,\n2015.\n\n[38] Miguel A Hernan and James M Robins. Causal Inference. CRC Press, 2019.\n\n[39] Daniel Hsu, Sham Kakade, and Tong Zhang. Tail inequalities for sums of random matrices that\ndepend on the intrinsic dimension. Electronic Communications in Probability, 17(14):1\u201313,\n2012.\n\n[40] Guido W Imbens and Whitney K Newey. Identi\ufb01cation and estimation of triangular simultaneous\n\nequations models without additivity. Econometrica, 77(5):1481\u20131512, 2009.\n\n[41] Rainer Kress. Linear Integral Equations, volume 3. Springer, 1989.\n\n[42] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural\n\nComputation, 17(1):177\u2013204, 2005.\n\n[43] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Sch\u00f6lkopf, et al. Kernel\nmean embedding of distributions: A review and beyond. Foundations and Trends in Machine\nLearning, 10(1-2):1\u2013141, 2017.\n\n[44] Michael P Murray. Avoiding invalid instruments and coping with weak instruments. Journal of\n\nEconomic Perspectives, 20(4):111\u2013132, 2006.\n\n[45] M Zuhair Nashed and Grace Wahba. Convergence rates of approximate least squares solutions of\nlinear integral and operator equations of the \ufb01rst kind. Mathematics of Computation, 28(125):69\u2013\n80, 1974.\n\n[46] M Zuhair Nashed and Grace Wahba. Generalized inverses in reproducing kernel spaces: An\napproach to regularization of linear operator equations. SIAM Journal on Mathematical Analysis,\n5(6):974\u2013987, 1974.\n\n[47] M Zuhair Nashed and Grace Wahba. Regularization and approximation of linear operator equa-\ntions in reproducing kernel spaces. Bulletin of the American Mathematical Society, 80(6):1213\u2013\n1218, 1974.\n\n[48] Whitney K Newey and James L Powell. Instrumental variable estimation of nonparametric\n\nmodels. Econometrica, 71(5):1565\u20131578, 2003.\n\n[49] Judea Pearl. Causality. Cambridge University Press, 2009.\n\n[50] Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algo-\nrithm in dual variables. In International Conference on Machine Learning, pages 515\u2013521,\n1998.\n\n[51] Bernhard Sch\u00f6lkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In\n\nInternational Conference on Computational Learning Theory, pages 416\u2013426, 2001.\n\n[52] Rahul Singh. Causal inference tutorial. Technical report, MIT Economics, 2019.\n\n[53] Steve Smale and Ding-Xuan Zhou. Shannon sampling II: Connections to learning theory.\n\nApplied and Computational Harmonic Analysis, 19(3):285\u2013302, 2005.\n\n[54] Steve Smale and Ding-Xuan Zhou. Learning theory estimates via integral operators and their\n\napproximations. Constructive Approximation, 26(2):153\u2013172, 2007.\n\n[55] Le Song, Arthur Gretton, and Carlos Guestrin. Nonparametric tree graphical models.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 765\u2013772, 2010.\n\nIn\n\n[56] Le Song, Jonathan Huang, Alex Smola, and Kenji Fukumizu. Hilbert space embeddings of\nconditional distributions with applications to dynamical systems. In International Conference\non Machine Learning, pages 961\u2013968, 2009.\n\n12\n\n\f[57] Bharath Sriperumbudur, Kenji Fukumizu, and Gert Lanckriet. On the relation between univer-\nsality, characteristic kernels and RKHS embedding of measures. In International Conference\non Arti\ufb01cial Intelligence and Statistics, pages 773\u2013780, 2010.\n\n[58] Douglas Staiger and James H Stock. Instrumental variables regression with weak instruments.\n\nEconometrica, 65(3):557\u2013586, 1997.\n\n[59] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Springer, 2008.\n[60] James H Stock and Francesco Trebbi. Retrospectives: Who invented instrumental variable\n\nregression? Journal of Economic Perspectives, 17(3):177\u2013194, 2003.\n\n[61] James H Stock, Jonathan H Wright, and Motohiro Yogo. A survey of weak instruments and\nweak identi\ufb01cation in generalized method of moments. Journal of Business & Economic\nStatistics, 20(4):518\u2013529, 2002.\n\n[62] Masashi Sugiyama, Ichiro Takeuchi, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya,\nand Daisuke Okanohara. Least-squares conditional density estimation. IEICE Transactions,\n93-D(3):583\u2013594, 2010.\n\n[63] Dougal Sutherland. Fixing an error in Caponnetto and De Vito. Technical report, UCL Gatsby\n\nUnit, 2017.\n\n[64] Zolt\u00e1n Szab\u00f3, Arthur Gretton, Barnab\u00e1s P\u00f3czos, and Bharath Sriperumbudur. Two-stage\nsampled learning theory on distributions. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 948\u2013957, 2015.\n\n[65] Zolt\u00e1n Szab\u00f3, Bharath Sriperumbudur, Barnab\u00e1s P\u00f3czos, and Arthur Gretton. Learning theory\n\nfor distribution regression. Journal of Machine Learning Research, 17(152):1\u201340, 2016.\n\n[66] Grace Wahba. Spline Models for Observational Data, volume 59. SIAM, 1990.\n[67] Larry Wasserman. All of Nonparametric Statistics. Springer, 2006.\n[68] Philip G Wright. Tariff on Animal and Vegetable Oils. Macmillan Company, 1928.\n\n13\n\n\f", "award": [], "sourceid": 2580, "authors": [{"given_name": "Rahul", "family_name": "Singh", "institution": "MIT"}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": "Gatsby Unit, UCL"}, {"given_name": "Arthur", "family_name": "Gretton", "institution": "Gatsby Unit, UCL"}]}