{"title": "Causal Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 12704, "page_last": 12714, "abstract": "We argue that regularizing terms in standard regression methods not only help against overfitting finite data, but sometimes also help in getting better causal models. We first consider a multi-dimensional variable linearly influencing a target variable with some multi-dimensional unobserved common cause, where the confounding effect can be decreased by keeping the penalizing term in Ridge and Lasso regression even in the population limit. The reason is a close analogy between overfitting and confounding observed for our toy model. In the case of overfitting, we can choose regularization constants via cross validation, but here we choose the regularization constant by first estimating the strength of confounding, which yielded reasonable results for simulated and real data. Further, we show a \u2018causal generalization bound\u2019 which states (subject to our particular model of confounding) that the error made by interpreting any non-linear regression as causal model can be bounded from above whenever functions are taken from a not too rich class.", "full_text": "Causal Regularization\n\nDominik Janzing\n\nAmazon Research T\u00fcbingen\n\nGermany\n\njanzind@amazon.com\n\nAbstract\n\nWe argue that regularizing terms in standard regression methods not only help\nagainst over\ufb01tting \ufb01nite data, but sometimes also help in getting better causal\nmodels. We \ufb01rst consider a multi-dimensional variable linearly in\ufb02uencing a\ntarget variable with some multi-dimensional unobserved common cause, where\nthe confounding effect can be decreased by keeping the penalizing term in Ridge\nand Lasso regression even in the population limit. The reason is a close analogy\nbetween over\ufb01tting and confounding observed for our toy model. In the case of\nover\ufb01tting, we can choose regularization constants via cross validation, but here we\nchoose the regularization constant by \ufb01rst estimating the strength of confounding,\nwhich yielded reasonable results for simulated and real data. Further, we show\na \u2018causal generalization bound\u2019 which states (subject to our particular model of\nconfounding) that the error made by interpreting any non-linear regression as\ncausal model can be bounded from above whenever functions are taken from a not\ntoo rich class.\n\n1\n\nIntroduction\n\nPredicting a scalar target variable Y from a d-dimensional predictor X := (X1, . . . , Xd) via appro-\npriate regression models is among the classical problems of machine learning [1]. In the standard\nsupervised learning scenario, some \ufb01nite number of observations, independently drawn from an un-\nknown but \ufb01xed joint distribution PY,X, are used for inferring Y -values corresponding to unlabelled\nX-values. To solve this task, regularization is known to be crucial for obtaining regression models\nthat generalize well from training to test data [2]. Deciding whether such a regression model admits a\ncausal interpretation is, however, challenging. Even if causal in\ufb02uence from Y to X can be excluded\n(e.g. by time order), the statistical relation between X and Y cannot necessarily be attributed to the\nin\ufb02uence of X on Y . Instead, it could be due to possible common causes, also called \u2018confounders\u2019.\nFor the case where common causes are known and observed, there is a huge number of techniques to\ninfer the causal in\ufb02uence1, e.g., [3], addressing different challenges, for instance, high dimensional\nconfounders [4] or the case where some variables other than the common causes are observed [5],\njust to mention a few of them. If common causes are not known, the task of inferring the in\ufb02uence\nof X on Y gets incredibly hard. Given observations from any further variables other than X and Y ,\nconditional independences may help to detect or disprove the existence of common causes [5], and\nso-called instrumental variables may admit the identi\ufb01cation of causal in\ufb02uence [6].\nHere we consider the case where only observations from X and Y are given. In this case, naively\ninterpreting the regression model as causal model is a natural baseline. We show that strong\nregularization increases the chances that the regression model contains some causal truth. We\nare aware of the risk that this result could be mistaken as a justi\ufb01cation to ignore the hardness of the\nproblem and blindly infer causal models by strong regularization. Our goal is, instead, to inspire a\n\n1often for d = 1 and with a binary treatment variable X\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fdiscussion on to what extent causal modelling should regularize even in the in\ufb01nite sample limit due\nto some analogies between generalizing across samples from the same distribution and \u2018generalizing\u2019\nfrom observational to interventional distributions, which appear in our models of confounding, while\nthey need not apply to other confounding scenarios. The idea is not entirely novel since it is tightly\nlinked to several ideas that are \u2018\ufb02oating around\u2019 in the machine learning community for a while.\nIt is believed (and can be proven subject to appropriate model assumptions) that \ufb01nding statistical\nmodels that generalize well across different background conditions is closely linked to \ufb01nding causal\nmodels [7, 8, 9, 10].2 It is then natural to also believe that generalizing across different environment is\nrelated to generalizing across different samples. Accordingly, [12] describes regularization techniques\nfor linear regression that help generalizing across certain shift perturbations. Here we describe a\nscenario for which the analogy between \u2018regularizing against over\ufb01tting\u2019 and \u2018regularizing against\nconfounding\u2019 gets as tight as possible in the sense that the same regularization helps for both purposes.\nDue to this theoretical focus, we prefer to work with the simplest non-trivial scenario rather than\nlooking for the most relevant or most realistic case.\n\nScenario 1: inferring a linear statistical model To explain the idea, we consider a linear statistical\nrelation between X and Y :\n\n(cid:80)n\n\nY = Xa + E,\n\n(1)\nwhere a is a column vector in Rd and E is an uncorrelated unobserved noise variable, i.e., \u03a3XE = 0.\nLet \u02c6Y denote the column vector of centred renormalized observations yi of Y , i.e., with entries\n(yi \u2212 1\nn \u2212 1, and similarly, \u02c6E denotes the centred renormalized values of E. Likewise,\nlet \u02c6X denote the n \u00d7 d matrix whose j-th column contains the centred renormalized observations\nfrom Xj. Let, further, \u02c6X\u22121 denote its (Moore-Penrose) pseudoinverse. To avoid over\ufb01tting, the least\nordinary squares estimator3\n\n\u221a\ni=1 yi)/\n\nn\n\n\u02c6a := argmina(cid:48)(cid:107) \u02c6Y \u2212 \u02c6Xa(cid:48)(cid:107)2 = \u02c6X\u22121 \u02c6Y = a + \u02c6X\u22121 \u02c6E,\n\nis replaced with the Ridge and Lasso estimators\n\n\u02c6aridge\n\u03bb\n\u02c6alasso\n\u03bb\n\n:= argmina(cid:48){\u03bb(cid:107)a(cid:48)(cid:107)2\n2 + (cid:107) \u02c6Y \u2212 \u02c6Xa(cid:48)(cid:107)2}\n:= argmina(cid:48){\u03bb(cid:107)a(cid:48)(cid:107)1 + (cid:107) \u02c6Y \u2212 \u02c6Xa(cid:48)(cid:107)2},\n\n(2)\n\n(3)\n(4)\n\nwhere \u03bb is a regularization parameter [13].\nSo far we have only described the standard scenario of inferring properties of the conditional PY |X\nfrom \ufb01nite observations \u02c6X, \u02c6Y without any causal semantics.\n\nScenario 2: inferring a linear causal model We now modify the scenario in three respects. First,\nwe assume that E and X in (1) correlate due to some unobserved common cause. Second, we\ninterpret (1) in a causal way in the sense that setting X to x lets Y being distributed according to\nxa + E. Using Pearl\u2019s do-notation (a crucial concept for formalizing causality) [5], this can be\nphrased as\n\nY |do(X=x) = xa + E (cid:54)= Y |X=x,\n\n(5)\nwhere we don\u2019t have equality because E needs to be replaced with E|X=x for the observational\nconditional. Third, we assume the in\ufb01nite sample limit where PX,Y is known. We still want to infer\nthe vector a because we are interested in causal statements but regressing Y on X yields \u02c6a instead\nwhich describes the observational conditional on the right hand side of (5).\nConceptually, Scenario 1 and 2 deal with two entirely different problems: inferring PY |X=x from\n\ufb01nite samples ( \u02c6X, \u02c6Y ) versus inferring the interventional conditional PY |do(X=x) from the observa-\ntional distribution PY,X. Nevertheless both problems amount to inferring the vector a and for both\nscenarios, the error term \u02c6X\u22121 \u02c6E causes failure of ordinary least squares regression. Only the reason\nwhy this term is non-zero differs: in the \ufb01rst scenario it is a \ufb01nite sample effect, while it results from\nconfounding in the second one. The idea of the present paper is simply that standard regularization\n\n2After submission I became aware of a preprint with the same title as mine, [11] where regularizers are\n\nconstructed that are tailored particularly for causal features.\n\n3Here we have, for simplicity, assumed n > d.\n\n2\n\n\fE\n\nN (0, I) \u223c\n\nZ\n\nE = Zc\n\nX\n\nY\n\n= Xa + E\n\nZM =\n\nX\n\nY\n\n= Xa + E\n\nFigure 1: Left: In scenario 1, the empirical correlations between X and E are only \ufb01nite sample\neffects. Right: In scenario 2, X and E are correlated due to their common cause Z. We sample the\nstructural parameters M and c from distributions in a way that entails a simple analogy between\nscenario 1 and 2.\n\ntechniques do not care about the origin of this error term. Therefore, they can temper the impact of\nconfounding in the same way as they help avoiding to over\ufb01t \ufb01nite data.\nThe paper is structured as follows. Section 2 \ufb02eshes out scenarios 1 and 2 in a way that entails\nthat the regression error follows the same distributions. Section 3 proposes a way to determine\nthe regularization parameter in scenario 2 by estimating the strength of confounding via a method\nproposed by [14]. Section 4 describes some empirical results. Section 5 describes a modi\ufb01ed statistical\nlearning theory that states that regression models from not too rich function classes \u2018generalize\u2019 from\nstatistical to causal statements.\n\n2 Analogy between over\ufb01tting and confounding\n\nThe reason why our scenario 2 only considers the in\ufb01nite sample limit of confounding is that mixing\n\ufb01nite sample and confounding signi\ufb01cantly complicates the theoretical results. The supplement\nsketches the complications of this case. For a concise description of the population case, we\nconsider the Hilbert space H of centred random variables (on some probability space without further\nspeci\ufb01cation) with \ufb01nite variance. The inner product is given by the covariance, e.g.,\n\nAccordingly, we can interpret X as an operator4 Rd \u2192 H via (b1, . . . , bd) (cid:55)\u2192(cid:80)\n\n(cid:104)Xi, Xj(cid:105) := cov(Xi, Xj).\n\n(6)\n\nj bjXj. Then the\n\npopulation version of (2) reads\n\n\u02dca = argmina(cid:48){(cid:107)Y \u2212 Xa(cid:48)(cid:107)2} = X\u22121Y = a + X\u22121E,\n\n(7)\n\nwhere the square length is induced by the inner product (6), i.e., it is simply the variance. Extending\nthe previous notation, X\u22121 now denotes the pseudoinverse of the operator X [15]. To see that X\u22121E\nis only non-zero when X and E are correlated it is helpful to rewrite it as\n\nX\u22121E = \u03a3\u22121\n\nXX\u03a3XE,\n\n(8)\n\nE/n).5\n\nwhere we have assumed \u03a3XX to be invertible (see supplement for a proof). One can easily show\nthat the empirical covariance matrix (cid:91)\u03a3XE causing the over\ufb01tting error is distributed according to\nN (0, (cid:91)\u03a3XX\u03c32\nTo get the desired analogy between scenarios 1 and 2, we just need a generating model for confounders\nfor which \u03a3XE is distributed according to N (0, \u03b3\u03a3XX) for some parameter \u03b3. The independent\nsource model for confounding described in [14] turned out to satisfy this requirement after some\nfurther speci\ufb01cation.\n\nGenerating model for scenario 1 The following procedure generates samples according to the\nDAG in Figure 1, left:\n\n4Readers not familiar with operator theory may read all our operators as matrices with huge n without loosing\nany essential insights \u2013 except for the cost of having to interpret all equalities as approximate equalities. To\nfacilitate this way of reading, we will use (\u00b7)T also for the adjoint of operators in H although (\u00b7)\u2217 or (\u00b7)\u2020 is\ncommon.\nE) and thus ej \u223c N (0, 1/n), which implies \u02c6E \u223c N (0, I/n) and thus (cid:91)\u03a3XE = \u02c6XT \u02c6E \u223c\n\n5E \u223c N (0, \u03c32\n\nN (0, \u02c6XT \u02c6X\u03c32\n\nE/n) = N (0, (cid:91)\u03a3XX\u03c32\n\nE/n).\n\n3\n\n\f1. Draw n observations from (X1, . . . , Xd) independently from PX\n2. Draw samples of E independently from PE\n3. Draw the vector a of structure coef\ufb01cients from some distribution Pa\n4.\n\nSet \u02c6Y := \u02c6Xa + \u02c6E.\n\nGenerating model for scenario 2 To generate random variables according to the DAG in Figure 1,\nright, we assume that both variables X and E are generated from the same set of independent sources\nby applying a random mixing matrix or a random mixing vector, respectively:\nGiven an (cid:96)-dimensional random vector Z of sources with distribution N (0, I).\n1 . Choose an (cid:96) \u00d7 d mixing matrix M\n\nand set X := ZM.\nDraw c \u2208 R(cid:96) from some distribution Pc and set E := Zc.\nDraw the vector a of structure coef\ufb01cients from some distribution Pa\nSet Y := Xa + E.\n\n2.\n3.\n4.\n\nWe then obtain:\nTheorem 1 (population and empirical covariances). Let the number (cid:96) of sources in scenario 2 be\nequal to the number n of samples in scenario 1 and PM coincide with the distribution of sample\nmatrices \u02c6X induced by PX. Let, moreover, Pc in scenario 2 coincide with the distribution of \u02c6E\ninduced by PE in scenario 1, and Pa be the same in both scenarios. Then the joint distribution\nof a, \u03a3XX, \u03a3XY , \u03a3XE in scenario 2 coincides with the joint distribution of a, (cid:91)\u03a3XX, (cid:91)\u03a3XY , (cid:91)\u03a3XE in\nscenario 1.\nProof. We have (cid:91)\u03a3XX = \u02c6XT \u02c6X and \u03a3XX = XT X = M T ZT ZM = M T M, where we have used\nthat Z has full rank due to the uncorrelatedness of its components. Likewise, (cid:91)\u03a3XE = \u02c6XT \u02c6E and\n\u03a3XE = (ZM )T Zc = M T c. Further, (cid:91)\u03a3XY = \u02c6XT \u02c6Xa + (cid:91)\u03a3XE and \u03a3XY = XT Xa + \u03a3XE. Then\nthe statement follows from the correspondences M \u2261 \u02c6X, c \u2261 \u02c6E, a \u2261 a.\nTheorem 1 provides a canonical way to transfer any Bayesian approach to inferring a from (cid:91)\u03a3XX, (cid:91)\u03a3XY\nin scenario 1 to inferring a from \u03a3XX, \u03a3XY in scenario 2. It is known [16], for instance, that (3) and\n(4) maximize the posterior p(a| \u02c6X, \u02c6Y ) for the priors\n\n(cid:18)\n\n(cid:19)\n\nplasso(a) \u223c exp\n\n\u2212 1\n2\u03c4 2(cid:107)a(cid:107)1\n\n,\n\n(9)\n\n(cid:18)\n\n(cid:19)\n\npridge(a) \u223c exp\n\n2\u03c4 2(cid:107)a(cid:107)2\n\u2212 1\nE) and \u03bb = \u03c32\n\nrespectively, if E \u223c N (0, \u03c32\nE/\u03c4 2. Some algebra shows that the only information\nfrom \u02c6X and \u02c6Y that matters is given by (cid:91)\u03a3XX and (cid:91)\u03a3XY , see supplement. Therefore, (3) and (4)\nalso maximize the posterior p(a|(cid:91)\u03a3XX, (cid:91)\u03a3XY ) and, employing Theorem 1, the population versions of\nRidge and Lasso\n\n\u02dcaridge\n\u03bb\n\u02dcalasso\n\u03bb\n\n:= argmina(cid:48){\u03bb(cid:107)a(cid:48)(cid:107)2\n2 + (cid:107)Y \u2212 Xa(cid:48)(cid:107)2}\n:= argmina(cid:48){\u03bb(cid:107)a(cid:48)(cid:107)1 + (cid:107)Y \u2212 Xa(cid:48)(cid:107)2},\n\n(10)\n(11)\n\nmaximize p(a|\u03a3XX, \u03a3XY ) after substituting all the priors accordingly.\nThese population versions, however, make it apparent that we now face the problem that selecting \u03bb\nby cross-validation would be pointless since \u03bb = 0 had the best cross sample performance. Instead,\nwe would need to know the strength of confounding to choose the optimal \u03bb.\n\n3 Choosing the regularization constant by estimating confounding\n\nThe only approaches that directly estimate the strength of confounding6 from PX,Y alone we are\naware of are given by [19, 14]. The \ufb01rst paper considers only one-dimensional confounders, which is\ncomplementary to our confounding scenario, while we will use the approach from the second paper\n\n6[17] constructs confounders for linear non-Gaussian models and [18] infer confounders of univariate X, Y\n\nsubject to the additive noise assumption.\n\n4\n\n\fbecause it perfectly matches our scenario 2 in Section 2 with \ufb01xed M. [14] use the slightly stronger\nassumption that a and c are drawn from N (0, \u03c32\nc I), respectively. We brie\ufb02y rephrase\nthe method. Using \u02dca in (7) (i.e. the population version of the ordinary least squares solution), they\nde\ufb01ne confounding strength by\n\naI) and N (0, \u03c32\n\n\u03b2 :=\n\n(cid:107)\u02dca \u2212 a(cid:107)2\n\n(cid:107)\u02dca \u2212 a(cid:107)2 + (cid:107)a(cid:107)2 \u2208 [0, 1].\n\n(12)\n\nIt attains 0 iff \u02dca coincides with a and 1 iff a = 0 and the correlations between X and Y are entirely\ncaused by confounding.\nThe idea to estimate \u03b2 is that the unregularized regression vector follows the distribution N (0, \u03c32\nc M\u22121M\u2212T ). This results from\n\u03c32\n\naI +\n\n\u02dca = a + X\u22121E = a + M\u22121c,\n\nc /\u03c32\n\n(see proof of Theorem 1 in [14]). Then the quotient \u03c32\na can be inferred from the direction of \u02c6a\n(intuitively: the more \u02c6a concentrates in small eigenvalue eigenspaces of \u03a3XX = M T M, the larger is\nthis quotient). Using some approximations that hold for large d, \u03b2 can be estimated from (\u03a3XX, \u02dca).\nFurther, the approximation (cid:107)\u02dca\u2212 a(cid:107)2 +(cid:107)a(cid:107)2 \u2248 (cid:107)\u02dca(cid:107)2 from [19] yields (cid:107)a(cid:107)2 \u2248 (1\u2212 \u03b2)\u00b7(cid:107)\u02dca(cid:107)2. Hence,\nthe length of the true causal regression vector a can be estimated from the length of \u02dca. This way, we\ncan adjust \u03bb such that (cid:107)\u02c6a\u03bb(cid:107) coincides with the estimated length. Since the estimation is based on\na Gaussian (and not a Laplacian) prior for a, it seems more appropriate to combine it with Ridge\nregression than with Lasso. However, due to known advantages of Lasso7 (e.g. that sparse solutions\nyield more interpretable results), we also use Lasso. After all, the qualitative statement that strong\nconfounding amounts to vectors \u02c6a that tend to concentrate in low eigenvalue subspaces of \u03a3XX still\nholds true as long as c is still chosen from an isotropic prior.\nConfounding estimation via the algorithm of [14] requires the problematic decision of whether the\nvariables Xj should be rescaled to variance 1. If different Xj refer to different units, there is no other\nstraightforward choice of the scale. It is not recommended, however, to always normalize Xj. If\n\u03a3XX is diagonal, for instance, the method would be entirely spoiled by normalization. The dif\ufb01culty\nof deciding whether data should be renormalizing beforehand will be inherited by our algorithm.\nOur confounder correction algorithm reads:\n\nAlgorithm ConCorr\n1: Input: i.i.d. samples from P (X, Y ).\n2: Rescale Xj to variance 1 if desired.\n3: Compute the empirical covariance matrices (cid:91)\u03a3XX and (cid:91)\u03a3XY\n4: Compute the ordinary least squares regression vector \u02c6a := (cid:91)\u03a3XX\n5: Compute an estimator \u02c6\u03b2 for the confounding strength \u03b2 via the algorithm in [14] from (cid:91)\u03a3XX and\n\n\u22121(cid:91)\u03a3XY\n\n\u02c6a and estimate the squared length of a via\n\n(cid:107)a(cid:107)2 \u2248 (1 \u2212 \u02c6\u03b2)(cid:107)\u02c6a(cid:107)2\n\n(13)\n\n6: \ufb01nd \u03bb such that the squared length of \u02c6aridge/lasso\n\ncoincides with the square root of the right\n\nhand side of (13).\n\n7: Compute Ridge or Lasso regression model using this value of \u03bb.\n8: Output: Regularized regression vectors \u02c6aridge/lasso\n\n.\n\n\u03bb\n\n\u03bb\n\n4 Experiments\n\n4.1 Simulated data\n\nFor some \ufb01xed values of d = (cid:96) = 30, we generate one mixing matrix M in each run by drawing its\nentries from the standard normal distribution. In each run we generate n = 1000 instances of the\n\n7[20] claims, for instance, \u201cIf (cid:96)2 was the norm of the 20th century, then (cid:96)1 is the norm of the 21st century ...\nOK, maybe that statement is a bit dramatic, but at least so far, there\u2019s been a frenzy of research involving the (cid:96)1\nnorm and its sparsity-inducing properties....\u201d\n\n5\n\n\fFigure 2: From left to right: RSE versus unregularized RSE (that is, ordinary least square regression)\nfor Concorr with Ridge, standard cross-validated Ridge (top, left and right, respectively), and\nConCorr with Lasso, standard cross-validaded Lasso (bottom, left and right, respectively) for 100\nruns (each point representing one run).\n\nc ) and N (0, \u03c32\n\n(cid:96)-dimensional standard normal random vector Z and compute the X values by X = ZM. Afterwards\nwe draw the entries of c and a from N (0, \u03c32\na), respectively, after choosing \u03c3a and \u03c3c\nfrom the uniform distribution on [0, 1]. Finally, we compute the values of Y via Y = Xa + Zc + E,\nwhere E is random noise drawn from N (0, \u03c32\nE) (the parameter \u03c3E has previously been chosen\nuniformly at random from [0, 5], which yields quite noisy data). While such a noise term didn\u2019t exist\nin our description of scenario 2, we add it here to also study \ufb01nite sample effects (without noise, Y\ndepends deterministically on X for (cid:96) \u2264 d).\nTo assess whether the output \u02c6a\u03bb is close to a we de\ufb01ne the relative squared error (RSE) of any\nregression vector a(cid:48) by\n\n\u0001a(cid:48) :=\n\n(cid:107)a(cid:48) \u2212 a(cid:107)2\n\n(cid:107)a(cid:48) \u2212 a(cid:107)2 + (cid:107)a(cid:107)2 \u2208 [0, 1]\n\nThis de\ufb01nition is convenient because it yields the confounding strength \u03b2 for the special case where\na(cid:48) is the ordinary least squares regression vector \u02dca.\nFigure 2 shows the results. The red and green lines show two different baselines: \ufb01rst, the unregular-\nized error, and second, the error 1/2 obtained by the trivial regression vector 0. The goal is to stay\nbelow both baselines. Apart from those two trivial baselines, another natural baseline is regularized\nregression where \u03bb is chosen by cross-validation, because this would be the default approach for the\nunconfounded case. We have used leave-one-out CV from the Python package scikit for Ridge\nand Lasso, respectively.\nConCorr clearly outperforms cross-validation (for both Ridge and Lasso), which shows that cross-\nvalidation regularizes too weakly for causal modelling, as expected. One should add, however, that\nwe increased the number of iterations in the \u03bb-optimization to get closer to optimal leave-one-out\nperformance since the default parameters of scikit already resulted in regularizing more strongly\nthan that (Note that the goal of this paper is not to show that ConCorr outperforms other methods.\n\n6\n\n\fFigure 3: Results for Ridge (left) and Lasso (right) regression for the data from the optical device.\n\nZ\n\nY (cid:48)\n\nX\n\nY\n\n= Y (cid:48) + Zc\n\nFigure 4: Confounding where Z in\ufb02uences Y in a linear additive way, while the in\ufb02uence on X is\narbitrary.\n\nInstead, we want to argue that for causal models it is often recommended to regularize more strongly\nthan criteria of statistical predictability suggests. If \u2018early stopping\u2019 in common CV algorithms also\nyields stronger regularization,8 this can be equally helpful for causal inference, although the way\nConCorr choses \u03bb is less arbitrary than just bounding the number of iterations).\nResults for other dimensions were qualitatively comparable if d and (cid:96) were above 10 with slow\nimprovement for larger dimensions, but note that the relevance of simulations should not be overes-\ntimated since inferring confounding critically depends on the distribution of eigenvalues of \u03a3XX,\nwhich is domain dependent in practical applications.\n\n4.2 Real data\n\nIn absence of better data sets with known ground truth, we considered two sets used in [14], where\nground truth was assumed to be known up to some uncertainty discussed there.\n\nOptical device Here, a Laptop shows an image with extremely low resolution (in their case 3 \u00d7 3-\npixel9) captured from a webcam. In front of the screen they mounted a photodiode measuring the\nlight intensity Y , which is mainly in\ufb02uenced by the pixel vector X of the image.\nThe confounder W is a random voltage controlling two LEDs, one in front of the webcam (and\nthus in\ufb02uencing X) and the second one in front of the photodiode (thus in\ufb02uencing Y ). Since W is\nalso measured, the vector aX,W obtained by regressing Y on (X, W ) is causal (no confounders by\nconstruction), if one accepts the linearity assumption. Dropping W yielded signi\ufb01cant confounding,\nwith \u03b2 ranging from 0 to 1. We applied ConCorr to X, Y and compared the output with the ground\ntruth. Figure 3, left, show the results for Ridge and Lasso. The y-axis is the relative squared error\nachieved by ConCorr, while the x-axis is the cross-validated baseline.\nThe point (0, 0) happened to be met by three cases, where no improvement was possible. One can\nsee that in 3 out of the remaining nine cases (note that the point (1, 1) is also met by two cases),\nConCorr signi\ufb01cantly improved the causal prediction. Fortunately, there is no case where ConCorr\nis worse than the baseline.\n\n8See also [21] for regularization by early stopping in a different context.\n9In order to avoid over\ufb01tting issues, we decided in Ref. [14] to only generate low-dimensional data with d\n\naround 10.\n\n7\n\n\fTaste of wine This data has been extracted from the UCI machine learning repository [22] for the\nexperiments in [14]. The cause X contains 11 ingredients of different sorts of red wine and Y is\nthe taste assigned by human subjects. Regressing Y on X yields a regression vector for which the\ningredient alcohol dominates. Since alcohol strongly correlates with some of the other ingredients,\ndropping it amounts to signi\ufb01cant confounding (assuming that the correlations between alcohol and\nthe other ingredients is due to common causes and not due to the in\ufb02uence of alcohol on the others).\nAfter normalizing the ingredients10, ConCorr with Ridge and Lasso yielded a relative error of 0.45\nand 0.35, respectively, while [14] computed the confounding strength \u03b2 \u2248 0.8, which means that\nConCorr signi\ufb01cantly corrects for confounding (we con\ufb01rmed that CV also yielded errors close to\n0.8 which suggests that \ufb01nite sample effects did not matter for the error).\nAlthough one-dimensional confounding heavily violates our model assumptions, the results of both\nreal data experiments look somehow positive.\n\n5 Causal learning theory\n\nSo far, we have supported causal regularization mainly via transferring Bayesian arguments for\nregularization from scenario 1 to scenario 2. An alternative perspective on regularization is provided\nby statistical learning theory [2]. Generalization bounds guarantee that the expected error is unlikely\nto signi\ufb01cantly exceed the empirical error for any regression function f from a not too rich class F.\nIf L(Y, f (X)) denotes some loss function, they guarantee, for instance, that the following inequality\nholds with a certain probability uniformly for f \u2208 F:\n\nE[L(Y, f (X)] \u2264 1\nn\n\nL(yi, f (xi)) + C(F),\n\nn(cid:88)\n\nj=1\n\n(cid:90)\n\nwhere C(F) is some \u2018capacity term\u2019.\nIn the same way, as these bounds relate empirical loss with expected loss, we will relate the expected\n(statistical) loss above with the interventional loss\n\nEdo(X)[L(Y, f (X)] :=\n\nL(y, f (x))p(y|do(x))p(x)dx,\n\n(14)\n\n(which quanti\ufb01es how well f describes the change of Y for interventions on X) via a causal\ngeneralization bound of the form\n\nEdo(X)[L(Y, f (X)] \u2264 E[L(Y, f (X))] + C(F),\n\nfor some capacity term C(F). Note that the type of causal learning learning theory developed here\nshould not be confused with [23], which considers the generalization error of classi\ufb01ers that infer\ncause-effect directions after being trained with multiple data sets of cause-effect pairs.\nFigure 4 shows our confounding model that signi\ufb01cantly generalizes our previous models. Z and X\nare arbitrary random variables of dimensions (cid:96) and d, respectively. Apart from the graphical structure,\nwe only add the parametric assumption that the in\ufb02uence of Z on Y is linear additive:\n\n(cid:90)\n\n(15)\nwhere c \u2208 R(cid:96). The change of Y caused by setting X to x via interventions is given by Pearl\u2019s\nbackdoor criterion [5] as\n\nY = Y (cid:48) + Zc,\n\np(y|do(x)) =\n\np(y|x, z)p(z)dz.\n\n(16)\nNote that the observational conditional p(y|x) would be given by replacing p(z) with p(z|x) in (16).\nInterventional conditionals destroy the dependences between the confounder Z and the \u2018treatment\u2019\nvariable X by de\ufb01nition of an intervention. The supplement shows that the difference between\ninterventional and observational loss can be concisely phrased in terms of covariances if we choose\nthe loss L(Y, f (X)) = (Y \u2212 f (X))2:\nLemma 1 (interventional minus observational loss). Let g(x) := E[Y (cid:48)|x]. Then\nEdo(X)[(Y \u2212 f (X))2] \u2212 E[(Y \u2212 f (X))2] = (\u03a3(f\u2212g)(X)Z)c.\n\n10Note that [14] also used normalization to achieve reasonable estimates of confounding for this case.\n\n8\n\n\fFor every single f, the vector \u03a3(f\u2212g)(X)Z is likely to be almost orthogonal to c if c is randomly\ndrawn from a rotation invariant distribution in R(cid:96). In order to derive statements of this kind that hold\nuniformly for all functions from a function class F we introduce the following concept quantifying\nthe capacity of F:\nDe\ufb01nition 1 (correlation dimension). Let F be some class of functions f : Rd \u2192 R. Given the\ndistribution PX,Z, the correlation dimension dcorr of F is the dimension of the span of\n\n{\u03a3f (X)Z |f \u2208 F}.\n\nTo intuitively understand this concept it is instructive to consider the following immediate bounds:\nLemma 2 (bounds on correlation dimension). The correlation dimension of F is bounded from above\nby the dimension of the span of F. Moreover, if F consists of linear functions, another upper bound\nis given by the rank of \u03a3XZ.\nIn the supplement we show:\nTheorem 2 (causal generalization bound). Given the causal structure in Figure 4, where Z is (cid:96)-\ndimensional with covariance matrix \u03a3ZZ = I, in\ufb02uencing X in an arbitrary way. Let the in\ufb02uence\nof Z on Y be given by a \u2018random linear combination\u2019 of Z with variance V . Explicitly,\n\nY (cid:48) (cid:55)\u2192 Y = Y (cid:48) + Zc,\n\u221a\nwhere c \u2208 R(cid:96) is randomly drawn from the sphere of radius\nV according to the Haar measure of\nO((cid:96)). Let F have correlation dimension dcorr and satisfy the bound (cid:107)(f \u2212 g)(X)(cid:107)H \u2264 b for all\nf \u2208 F (where g(x) := E[Y (cid:48)|x]). Then, for any \u03b2 > 1,\n\n(cid:114)\n\nV \u00b7 \u03b2 \u00b7 dcorr + 1\n\n,\n\nEdo(X)[(Y \u2212 f (X))2] \u2264 E[(Y \u2212 f (X))2] + b \u00b7\n\n(cid:96)\n\nholds uniformly for all f \u2208 F with probability 1 \u2212 en(1\u2212\u03b2+ln \u03b2)/2.\nNote that \u03a3ZZ = I can always be achieved by the \u2018whitening\u2019 transformation Z (cid:55)\u2192 (\u03a3ZZ)\u22121/2Z.\nNormalization is convenient just because it enables a simple way to de\ufb01ne a \u2018random linear combina-\ntion of Z with variance V \u2019, which would be cumbersome to de\ufb01ne otherwise.\nTheorem 2 basically says that the interventional loss is with high probability close to the expected\nobservational loss whenever the number of sources signi\ufb01cantly exceeds the correlation dimension.\nNote that the confounding effect can nevertheless be large, that is, it would heavily spoil ordinary\nleast square (e.g. unregularized) regression. Consider, for instance, the case where (cid:96) = d and X and\nZ are related by X = Z. Let, moreover, Y (cid:48) = Xa for some a \u2208 Rd. Then the confounding can have\nsigni\ufb01cant impact on the correlations between Y and X due to Y = X(a + c), whenever c is large\ncompared to a. However, whenever F has low correlation dimension, the selection of the function f\nthat optimally \ufb01ts observational data is not signi\ufb01cantly perturbed by the term Xc. This is because\nXc \u2018looks like random noise\u2019 since F contains no function that is able to account for \u2018such a complex\ncorrelation\u2019. For the simple case where \u03a3XZ has low rank, for instance, the term Zc almost behaves\nlike noise for typical c (w.r.t. any class F of linear functions), because the majority of components of\nZ are uncorrelated with X, after appropriate basis change.\nSince (cid:96), dcorr, b in Theorem 2 are unobserved, its value will mostly consist in qualitative insights\nrather than providing quantitative bounds of practical use.\n\n6 What do we learn for the general case?\n\nDespite all concerns against our \u2018hand-tuned\u2019 confounder model, we want to stimulate a general\ndiscussion about recommending stronger regularization than criteria of statistical predictability\nsuggest, whenever one is actually interested in causal models. Our theoretical results suggest that\nthis helps in particular when a type of confounding is expected that \u2013 if present \u2013 generates complex\ndependences, which would strongly regularized regression treat as noise. The advice of limiting the\ncomplexity of models to capture some causal truth could also be relevant for modern deep learning,\nsince the goal of interpretability of algorithms for classi\ufb01cation or other standard tasks could possibly\nbe improved by having causal features rather than purely predictive ones.\nIt is, however, by no means intended to suggest that this simple recommendation would solve any of\nthe hard problems in causal inference.\n\n9\n\n\fReferences\n[1] B. Sch\u00f6lkopf and A. Smola. Learning with kernels. MIT Press, Cambridge, MA, 2002.\n\n[2] V. Vapnik. Statistical learning theory. John Wileys & Sons, New York, 1998.\n\n[3] D. Rubin. Direct and indirect causal effects via potential outcomes. Scandinavian Journal of\n\nStatistics, 31:161\u2013170, 2004.\n\n[4] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Du\ufb02o, C. Hansen, W. Newey, and J. Robins.\nDouble/debiased machine learning for treatment and structural parameters. The Econometrics\nJournal, 21(1):C1 \u2013 C68, 2018.\n\n[5] J. Pearl. Causality. Cambridge University Press, 2000.\n\n[6] G. Imbens and J. Angrist. Identi\ufb01cation and estimation of local average treatment effects.\n\nEconometrica, 62(2):467 \u2013 475, 1994.\n\n[7] J. Peters, P. B\u00fchlmann, and N. Meinshausen. Causal inference using invariant prediction:\nidenti\ufb01cation and con\ufb01dence intervals. Journal of the Royal Statistical Society, Series B\n(Statistical Methodology), 78(5):947\u20131012, 2016.\n\n[8] C. Heinze-Deml, J. Peters, and N. Meinshausen. Invariant causal prediction for nonlinear\n\nmodels. Journal of Causal Inference, 6:20170016, 2017.\n\n[9] C. Heinze-Deml and N. Meinshausen. Conditional variance penalties and domain shift robust-\n\nness. arXiv:1710.11469, 2017.\n\n[10] Z.. Shen, P. Cui, K. Kuang, and B. Li. On image classi\ufb01cation: Correlation v.s. causality.\n\narXiv:1708.06656, 2017.\n\n[11] M. Bahadori, K. Chalupka, E. Choi, R. Chen, W. Stewart, and J. Sun. Causal regularization.\n\narXiv:1702.02604 , 2017.\n\n[12] D. Rothenh\u00e4usler, N. Meinshausen, P. B\u00fchlmann, and J. Peters. Anchor regression: heteroge-\n\nneous data meets causality. arXiv:1801.06229, 2018.\n\n[13] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: Data mining,\n\ninference, and prediction. Springer-Verlag, New York, NY, 2001.\n\n[14] D. Janzing and B. Sch\u00f6lkopf. Detecting non-causal artifacts in multivariate linear regression\nmodels. In Proceedings of the 35th International Conference on Machine Learning (ICML\n2018), 2018.\n\n[15] F. Beutler. The operator theory of the pseudo-inverse I. Bounded operators. Journal of\n\nMathematical Analysis and Applications, 10(3):451 \u2013 470, 1965.\n\n[16] A. Hoerl and R. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.\n\nTechnometrics, 42(1):80\u201386, 2000.\n\n[17] P. Hoyer, S. Shimizu, A. Kerminen, and M. Palviainen. Estimation of causal effects using\nlinear non-gaussian causal models with hidden variables. International Journal of Approximate\nReasoning, 49(2):362 \u2013 378, 2008.\n\n[18] D. Janzing, J. Peters, J. Mooij, and B. Sch\u00f6lkopf. Identifying latent confounders using additive\nnoise models. In Proceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI 2009), 249-257. (Eds.) A. Ng and J. Bilmes, AUAI Press, Corvallis, OR, USA, 2009.\n\n[19] D. Janzing and B. Sch\u00f6lkopf. Detecting confounding in multivariate linear models via spectral\n\nanalysis. Journal of Causal Inference, 6(1), 2017.\n\n[20] R. Tibshirani and Wasserman L. Course on Statistical Machine Learning, chapter: \u201cSparsity\n\nand the Lasso\u201d, 2015. http://www.stat.cmu.edu/~ryantibs/statml/.\n\n10\n\n\f[21] G. Raskutti, M. Wainwright, and B. Yu. Early stopping for non-parametric regression: An opti-\nmal data-dependent stopping rule. In 2011 49th Annual Allerton Conference on Communication,\nControl, and Computing (Allerton), pages 1318\u20131325, Sep. 2011.\n\n[22] D. Dua and C. Graff. UCI machine learning repository, 2017. http://archive.ics.uci.\n\nedu/ml.\n\n[23] D. Lopez-Paz, K. Muandet, B. Sch\u00f6lkopf, and I. Tolstikhin. Towards a learning theory of cause-\neffect inference. In Proceedings of the 32nd International Conference on Machine Learning,\nvolume 37 of JMLR Workshop and Conference Proceedings, page 1452\u20131461. JMLR, 2015.\n\n11\n\n\f", "award": [], "sourceid": 6906, "authors": [{"given_name": "Dominik", "family_name": "Janzing", "institution": "Amazon"}]}