{"title": "Probabilistic latent variable models for distinguishing between cause and effect", "book": "Advances in Neural Information Processing Systems", "page_first": 1687, "page_last": 1695, "abstract": "We propose a novel method for inferring whether X causes Y or vice versa from joint observations of X and Y. The basic idea is to model the observed data using probabilistic latent variable models, which incorporate the effects of unobserved noise. To this end, we consider the hypothetical effect variable to be a function of the hypothetical cause variable and an independent noise term (not necessarily additive). An important novel aspect of our work is that we do not restrict the model class, but instead put general non-parametric priors on this function and on the distribution of the cause. The causal direction can then be inferred by using standard Bayesian model selection. We evaluate our approach on synthetic data and real-world data and report encouraging results.", "full_text": "Probabilistic latent variable models for distinguishing\n\nbetween cause and effect\n\nJoris M. Mooij\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\nOliver Stegle\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\njoris.mooij@tuebingen.mpg.de\n\noliver.stegle@tuebingen.mpg.de\n\nDominik Janzing\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\nKun Zhang\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\ndominik.janzing@tuebingen.mpg.de\n\nkun.zhang@tuebingen.mpg.de\n\nBernhard Sch\u00a8olkopf\n\nMPI for Biological Cybernetics\n\nT\u00a8ubingen, Germany\n\nbernhard.schoelkopf@tuebingen.mpg.de\n\nAbstract\n\nWe propose a novel method for inferring whether X causes Y or vice versa from joint\nobservations of X and Y . The basic idea is to model the observed data using probabilistic\nlatent variable models, which incorporate the effects of unobserved noise. To this end, we\nconsider the hypothetical effect variable to be a function of the hypothetical cause variable\nand an independent noise term (not necessarily additive). An important novel aspect of\nour work is that we do not restrict the model class, but instead put general non-parametric\npriors on this function and on the distribution of the cause. The causal direction can then\nbe inferred by using standard Bayesian model selection. We evaluate our approach on\nsynthetic data and real-world data and report encouraging results.\n\nIntroduction\n\n1\nThe challenge of inferring whether X causes Y (\u201cX \u2192 Y \u201d) or vice versa (\u201cY \u2192 X\u201d) from\njoint observations of the pair (X, Y ) has recently attracted increasing interest [1, 2, 3, 4, 5, 6, 7,\n8]. While the traditional causal discovery methods [9, 10] based on (conditional) independences\nbetween variables require at least three observed variables, some recent approaches can deal with\npairs of variables by exploiting the complexity of the (conditional) probability distributions. On\nan intuitive level, the idea is that the factorization of the joint distribution P (cause, e\ufb00ect) into\nP (cause)P (e\ufb00ect| cause) typically yields models of lower total complexity than the factorization\ninto P (e\ufb00ect)P (cause| e\ufb00ect). Although the notion of \u201ccomplexity\u201d is intuitively appealing, it is\nnot obvious how it should be precisely de\ufb01ned.\nIf complexity is measured in terms of Kolmogorov complexity, this kind of reasoning would be\nin the spirit of the principle of \u201calgorithmically independent conditionals\u201d [11], which can also\nbe embedded into a general theory of algorithmic-information-based causal discovery [12]. The\nfollowing theorem is implicitly stated in the latter reference (see remarks before (26) therein):\n\n1\n\n\fTheorem 1 Let P (X, Y ) be a joint distribution with \ufb01nite Kolmogorov complexity such that P (X)\nand P (Y | X) are algorithmically independent, i.e.,\n\nI(cid:0)P (X) : P (Y | X)(cid:1) += 0 ,\n\nwhere += denotes equality up to additive constants. Then:\n\nK(cid:0)P (X)(cid:1) + K(cid:0)P (Y | X)(cid:1) +\u2264 K(cid:0)P (Y )(cid:1) + K(cid:0)P (X | Y )(cid:1) .\n\n(1)\n\n(2)\n\nThe proof is given by observing that (1) implies that the shortest description of P (X, Y ) is given\nby separate descriptions of P (X) and P (Y | X). It is important to note at this point that the total\ncomplexity of the causal model consists of both the complexity of the conditional distribution and\nof the marginal of the putative cause. However, since Kolmogorov complexity is uncomputable,\nthis does not solve the causal discovery problem in practice. Therefore, other notions of complexity\nneed to be considered.\nThe work of [4] measures complexity in terms of norms in a reproducing kernel Hilbert space, but\ndue to the high computational costs it applies only to cases where one of the variables is binary. The\nmethods [1, 2, 3, 5, 6] de\ufb01ne classes of conditionals C and marginal distributions M, and prefer\nX \u2192 Y whenever P (X) \u2208 M and P (Y | X) \u2208 C but P (Y ) (cid:54)\u2208 M or P (X | Y ) (cid:54)\u2208 C. This can\nbe interpreted as a (crude) notion of model complexity: all probability distributions inside the class\nare simple, and those outside the class are complex. However, this a priori restriction to a particular\nclass of models poses serious practical limitations (even when in practice some of these methods\n\u201csoften\u201d the criteria by, for example, using the p-values of suitable hypothesis tests).\nIn the present work we propose to use a fully non-parametric, Bayesian approach instead. The\nkey idea is to de\ufb01ne appropriate priors on marginal distributions (of the cause) and on conditional\ndistributions (of the effect given the cause) that both favor distributions of low complexity. To decide\nupon the most likely causal direction, we can compare the marginal likelihood (also called evidence)\nof the models corresponding to each of the hypotheses X \u2192 Y and Y \u2192 X. An important novel\naspect of our work is that we explicitly treat the \u201cnoise\u201d as a latent variable that summarizes the\nin\ufb02uence of all other unobserved causes of the effect. The additional key assumption here is the\nindependence of the \u201ccausal mechanism\u201d (the function mapping from the cause and noise to the\neffect) and the distribution of the cause, an idea that was exploited in a different way recently for the\ndeterministic (noise-free) case [13]. The three main contributions of this work are:\n\nrestricting the class of possible causal mechanisms;\n\n\u2022 to show that causal discovery for the two-variable cause-effect problem can be done without\n\u2022 to point out the importance of accounting for the complexity of the distribution of the cause, in\n\u2022 to show that a Bayesian approach can be used for causal discovery even in the case of two\n\naddition to the complexity of the causal mechanism (like in equation (2));\n\ncontinuous variables, without the need for explicit independence tests.\n\nThe last aspect allows for a straightforward extension of the method to the multi-variable case, the\ndetails of which are beyond the scope of this article.1 Apart from discussing the proposed method on\na theoretical level, we also evaluate our approach on both simulated and real-world data and report\ngood empirical results.\n\n2 Theory\n\nWe start with a theoretical treatment of how to solve the basic causal discovery task (see Figure 1a).\n\n1For the special case of additive Gaussian noise, the method proposed in [1] would also seem to be a valid\nBayesian approach to causal discovery with continuous variables. However, that approach is \ufb02awed, as it\neither completely ignores the distribution for the cause, or uses a simple Gaussian marginal distribution for the\ncause, which may not be realistic (from the paper it is not clear exactly what is proposed). But, as suggested\nby Theorem 1, and as illustrated by our empirical results, the complexity of the input distribution plays an\nimportant role here that cannot be neglected, especially in the two-variable case.\n\n2\n\n\f(a)\n\nX\n\nE\n\nY\n\nor\n\nX\n\nY\n\n\u02dcE\n\n(b)\n\n\u03b8X\n\nxi\n\nei\n\n\u03b8f\n\nf\n\nyi = f (xi, ei)\n\n\u201cX causes Y \u201d\n\n\u201cY causes X\u201d\n\ni = 1, . . . , N\n\nFigure 1: Observed variables are colored gray, and unobserved variables are white. (a) The basic\ncausal discovery task: which of the two causal models gives the best explanation of the observed\ndata D = {(xi, yi)}N\n\ni=1? (b) More detailed version of the graphical model for \u201cX causes Y \u201d.\n\n2.1 Probabilistic latent variable models for causal discovery\n\nFirst, we give a more precise de\ufb01nition of the class of models that we use for representing that X\ncauses Y (\u201cX \u2192 Y \u201d). We assume that the relationship between X and Y is not deterministic, but\ndisturbed by unobserved noise E (effectively, the summary of all other unobserved causes of Y ).\nThe situation is depicted in the left-hand part of Figure 1a: X and E both cause Y , but although X\nand Y are observed, E is not. We make the following additional assumptions:\n(A) There are no other causes of Y , or in other words, we assume determinism: a function f exists\n\nsuch that\n\nThis function will henceforth be called the causal mechanism.\n\n(B) X and E have no common causes, i.e., X and E are independent:\n\nY = f (X, E).\n\nX\u22a5\u22a5E.\n\n(C) The distribution of the cause is \u201cindependent\u201d from the causal mechanism.2\n(D) The noise has a standard-normal distribution: E \u223c N (0, 1).3\nSeveral recent approaches to causal discovery are based on the assumptions (A) and (B) only, but\npose one of the following additional restrictions on f:\n\n\u2022 f is linear [2];\n\u2022 additive noise [5], where f (X, E) = F (X) + E for some function F ;\n\u2022 the post-nonlinear model [6], where f (X, E) = G(F (X) + E) for some functions F, G.\n\nFor these special cases, it has been shown that a model of the same (restricted) form in the reverse\ndirection Y \u2192 X that induces the same joint distribution on (X, Y ) does not exist in general. This\nasymmetry can be used for inferring the causal direction.\nIn practice, a limited model class may lead to wrong conclusions about the causal direction. For\nexample, when assuming additive noise, it may happen that neither of the two directions provides\na suf\ufb01ciently good \ufb01t to the data and hence no decision can be made. Therefore, we would like to\ndrop this kind of assumptions that limit the model class. However, assumptions (A) and (B) are not\nenough on their own: in general, one can always construct a random variable \u02dcE \u223c N (0, 1) and a\nfunction \u02dcf : R2 \u2192 R such that\n\nX = \u02dcf (Y, \u02dcE),\n\nY \u22a5\u22a5 \u02dcE\n\n(3)\n\n(for a proof of this statement, see e.g., [14, Theorem 1]).\nIn combination with the other two assumptions (C) and (D), however, one does obtain an asymmetry\nthat can be used to infer the causal direction. Note that assumption (C) still requires a suitable math-\nematical interpretation. One possibility would be to interpret this independence as an algorithmic\n\n2This assumption may be violated in biological systems, for example, where the causal mechanisms may\n\nhave been tuned to their input distributions through evolution.\n\nwith \u00afE \u223c N (0, 1) and \u00aff = f(cid:0)\u00b7, g(\u00b7)(cid:1).\n\n3This is not a restriction of the model class, since in general we can write E = g( \u00afE) for some function g,\n\n3\n\n\findependence similar to Theorem 1, but then we could not use it in practice. Another interpretation\nhas been used in [13] for the noise-free case (i.e., the deterministic model Y = f (X)). Here, our\naim is to deal with the noisy case. For this setting we propose a Bayesian approach, which will be\nexplained in the next subsection.\n2.2 The Bayesian generative model for X \u2192 Y\nThe basic idea is to de\ufb01ne non-parametric priors on the causal mechanisms and input distributions\nthat favor functions and distributions of low complexity. Inferring the causal direction then boils\ndown to standard Bayesian model selection, where preference is given to the model with the largest\nmarginal likelihood.\nWe introduce random variables xi (the cause), yi (the effect) and ei (the noise), for i = 1, . . . , N\ni=1 to denote the whole N-\nwhere N is the number of data points. We use vector notation x = (xi)N\ntuple of X-values xi, and similarly for y and e. To make a Bayesian model comparison between the\ntwo models X \u2192 Y and Y \u2192 X, we need to calculate the marginal likelihoods p(x, y | X \u2192 Y )\nand p(x, y | Y \u2192 X). Below, we will only consider the model X \u2192 Y and omit this from the\nnotation for brevity. The other model Y \u2192 X is completely analogous, and can be obtained by\nsimply interchanging the roles of X and Y .\nThe marginal likelihood for the observed data x, y under the model X \u2192 Y is given by (see also\nFigure 1b):\np(x, y) = p(x)p(y | x) =\n\n(cid:35)(cid:34)(cid:90) (cid:32) N(cid:89)\n\n\u03b4(cid:0)yi \u2212 f (xi, ei)(cid:1)pE(ei)\n\n(cid:33)\n\ni=1\n\ni=1\n\np(\u03b8X )d\u03b8X\n\np(xi | \u03b8X )\n\nde p(f | \u03b8f )df p(\u03b8f )d\u03b8f\n(4)\nHere, \u03b8X and \u03b8f parameterize prior distributions of the cause X and the causal mechanism f,\nrespectively. Note how the four assumptions discussed in the previous subsection are incorporated\n1, . . . , N. Assumption (B) is realized by the a priori independence p(x, e| \u03b8X ) = p(x| \u03b8X )pE(e).\nAssumption (C) is realized as the a priori independence p(f, \u03b8X ) = p(f )p(\u03b8X ). Assumption (D)\nis obvious by taking pE(e) := N (e| 0, 1).\n\ninto the model: assumption (A) results in Dirac delta distributions \u03b4(cid:0)yi \u2212 f (xi, ei)(cid:1) for each i =\n\n(cid:34)(cid:90) (cid:32) N(cid:89)\n\n(cid:33)\n\n(cid:35)\n\n2.3 Choosing the priors\nIn order to completely specify the model X \u2192 Y , we need to choose particular priors. In this work,\nwe assume that all variables are real numbers (i.e., x, y and e are random variables taking values in\nRN ), and use the following choices (although other choices are also possible):\n\u2022 For the prior distribution of the cause X, we use a Gaussian mixture model\n\np(xi | \u03b8X ) =\n\n\u03b1jN (xi | \u00b5j, \u03c32\nj )\n\nk(cid:88)\n\nj=1\n\nwith hyperparameters \u03b8X = (k, \u03b11, . . . , \u03b1k, \u00b51, . . . , \u00b5k, \u03c31, . . . , \u03c3k). We put an improper\nDirichlet prior (with parameters (\u22121,\u22121, . . . ,\u22121)) on the component weights \u03b1 and \ufb02at priors\non the component parameters \u00b5, \u03c3.\n\u2022 For the prior distribution p(f | \u03b8f ) of the causal mechanism f, we take a Gaussian process with\n\nzero mean function and squared-exponential covariance function:\n\n(cid:0)(x, e), (x(cid:48), e(cid:48))(cid:1) = \u03bb2\n\nk\u03b8f\n\nY exp\n\n(cid:18)\n\u2212 (x \u2212 x(cid:48))2\n\n(cid:19)\n\n2\u03bb2\nX\n\n(cid:18)\n\nexp\n\n\u2212 (e \u2212 e(cid:48))2\n\n2\u03bb2\nE\n\n(cid:19)\n\n(5)\n\nwhere \u03b8f = (\u03bbX , \u03bbY , \u03bbE) are length-scale parameters. The parameter \u03bbY determines the\namplitude of typical functions f (x, e), and the length scales \u03bbX and \u03bbE determine how quickly\ntypical functions change depending on x and e, respectively. In the additive noise case, for\nexample, the length scale \u03bbE is large compared to the length scale \u03bbX, as this leads to an almost\nlinear dependence of f on e. We put broad Gamma priors on all length-scale parameters.\n\n4\n\n\f2.4 Approximating the evidence\nNow that we have fully speci\ufb01ed the model X \u2192 Y , the remaining task is to calculate the integral (4)\nfor given observations x, y. As the exact calculation seems intractable, we here use a particular\napproximation of this integral.\n\nThe marginal distribution\n\nFor the model of the distribution of the cause p(x), we use an asymptotic expansion based on the\nMinimum Message Length principle that yields the following approximation (for details, see [15]):\n\n\u2212 log p(x) \u2248 min\n\n\u03b8X\n\n+\n\nk\n2\n\nlog\n\nN\n12\n\n+\n\n3k\n2\n\n\u2212 log p(x| \u03b8X )\n\n(6)\n\n\uf8eb\uf8ed k(cid:88)\n\nj=1\n\n(cid:18) N \u03b1j\n\n(cid:19)\n\n12\n\nlog\n\n\uf8f6\uf8f8 .\n\nThe conditional distribution\nFor the conditional distribution p(y | x) according to the model X \u2192 Y , we start by replacing the\nintegral over the length-scales \u03b8f by a MAP estimate:\n\nIntegrating over the latent variables e and using the Dirac delta function calculus (where we assume\ninvertability of the functions fx : e (cid:55)\u2192 f (e, x) for all x), we obtain:4\n\n(cid:90)\n\n(cid:90)\n\np(y | x) \u2248 max\n\n\u03b8f\n\np(\u03b8f )\n\n\u03b4(cid:0)y \u2212 f (x, e)(cid:1) pE(e)de p(f | \u03b8f )df.\n(cid:0)\u0001(f )(cid:1) p(f | \u03b8f )\n(cid:0)xi, \u0001i(f )(cid:1)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:90)\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202f\n\nJ(f )\n\npE\n\n\u2202e\n\ni=1\n\n\u03b4(cid:0)y \u2212 f (x, e)(cid:1) pE(e)de p(f | \u03b8f )df =\nN(cid:89)\nJ(f ) = det(cid:12)(cid:12)\u2207ef(cid:0)x, \u0001(f )(cid:1)(cid:12)(cid:12) =\n\nwhere \u0001(f ) is the (unique) vector satisfying y = f (x, \u0001), and\n\ndf\n\n(7)\n\nis the absolute value of the determinant of the Jacobian which results when integrating over the\nDirac delta function. The next step would be to integrate over all possible causal mechanisms f\n(which would be an in\ufb01nite-dimensional integral). However, this integral again seems intractable,\nand hence we revert to the following approximation. Because of space constraints, we only give a\nbrief sketch of the procedure here.\nLet us suppress the hyperparameters \u03b8f for the moment to simplify notation. The idea is to ap-\nproximate the in\ufb01nite-dimensional GP function f by a linear combination over basis functions \u03c6j\nparameterized by a weight vector \u03b1 \u2208 RN with a Gaussian prior distribution:\n\nf\u03b1(x, e) =\n\n\u03b1j\u03c6j(x, e),\n\n\u03b1 \u223c N (0, 1).\n\nN(cid:88)\n\nj=1\n\nNow, de\ufb01ning the matrix \u03a6ij(x, \u0001) := \u03c6j(xi, \u0001i), the relationship y = \u03a6(x, \u0001)\u03b1 gives a corre-\nspondence between \u0001 and \u03b1 (for \ufb01xed x and y), which we assume to be one-to-one. In particular,\n\u03b1 = \u03a6(x, \u0001)\u22121y. We can then approximate equation (7) by replacing the integral by a maximum:\n\n(cid:0)\u0001(\u03b1)(cid:1) N (\u03b1| 0, 1)\n\nJ(\u03b1)\n\n(cid:0)\u0001(cid:1) N (y | 0, \u03a6\u03a6T )\n\nJ(\u0001)\n\n, (8)\n\n= max\n\n\u0001\n\npE\n\nd\u03b1 \u2248 max\n\n\u03b1\n\npE\n\n(cid:90)\n\n(cid:0)\u0001(\u03b1)(cid:1) N (\u03b1| 0, 1)\n\nJ(\u03b1)\n\npE\n\nwhere in the last step we used the one-to-one correspondence between \u0001 and \u03b1.\n\n4Alternatively, one could \ufb01rst integrate over the causal mechanisms f, and then optimize over the noise\nvalues e, similar to what is usually done in GPLVMs [16]. However, we believe that for the purpose of causal\ndiscovery, that approach does not work well. The reason is that when optimizing over e, the result is often quite\ndependent on x, which violates our basic assumption that X\u22a5\u22a5E. The approach we follow here is more related\nto nonlinear ICA, whereas GPLVMs are related to nonlinear PCA.\n\n5\n\n\f(cid:123)(cid:122)\n\ni=1\n\n(9)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u03b8f ,\u0001\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:124)\n\n(cid:125)\n\n(cid:124)\n\nHyperpriors\n\nNoise prior\n\nGP marginal\n\n(cid:125)\n\n+\n\n\u2212 log N (\u0001| 0, I)\n\n\u2212 log N (y | 0, K)\n\n\u2212 log p(y | x) \u2248 min\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\u2212 log p(\u03b8f )\n\nAfter working out the details and taking the negative logarithm, the \ufb01nal optimization problem\nbecomes:\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8 .\nN(cid:88)\nlog(cid:12)(cid:12)Mi\u00b7K\u22121y(cid:12)(cid:12)\n(cid:123)(cid:122)\n(cid:125)\n(cid:124)\nHere, the kernel (Gram) matrix K is de\ufb01ned by Kij := k(cid:0)(xi, \u0001i), (xj, \u0001j)(cid:1), where k : R4 \u2192 R\n(cid:0)(xi, \u0001i), (xj, \u0001j)(cid:1). Note that the matrices K and M both depend upon \u0001.\n\nis the covariance function (5).\ncontains the expected mean derivatives of the GP with respect to e and is de\ufb01ned by Mij\n\u2202k\n\u2202e\nThe Information term in the objective function (involving the partial derivatives \u2202k\n\u2202e ) may be sur-\nprising at \ufb01rst sight. It is necessary, however, to penalize dependences between x and \u0001: ignoring\nit would yield an optimal \u0001 that is heavily dependent on x, violating assumption (B). Interestingly,\nthis term is not present in the additive noise case that is usually considered, as the derivative of the\ncausal mechanism with respect to the noise equals one, and its logarithm therefore vanishes. In the\nnext subsection, we discuss some implementation issues that arise when one attempts to solve (6)\nand (9) in practice.\n\nIt corresponds to \u03a6\u03a6T in our approximation. The matrix M\n:=\n\nInformation term\n\nImplementation issues\n\nFirst of all, we preprocess the observed data x and y by standardizing them to zero mean and unit\nvariance for numerical reasons: if the length scales become too large, the kernel matrix K becomes\ndif\ufb01cult to handle numerically.\nWe solve the optimization problem (6) concerning the marginal distribution numerically by means\nof the algorithm written by Figueiredo and Jain [15]. We use a small but nonzero value (10\u22124) of\nthe regularization parameter.\nThe optimization problem (9) concerning the conditional distribution poses more serious practical\nproblems. Basically, since we approximate a Bayesian integral by an optimization problem, the\nobjective function (9) still needs to be regularized: if one of the partial derivatives \u2202f\n\u2202e becomes zero,\nthe objective function diverges. In addition, the kernel matrix corresponding to (5) is extremely\nill-posed. To deal with these matters, we propose the following ad-hoc solutions:\n\n\u221a\n\nit as log |x| \u2248 log\n\n\u2022 We regularize the numerically ill-behaving logarithm in the last term in (9) by approximating\n\u2022 We add a small amount of N (0, \u03c32)-uncertainty to each observed yi-value, with \u03c3 (cid:28) 1. This\nis equivalent to replacing K by K + \u03c32I, which regularizes the ill-conditioned matrix K. We\nused \u03c3 = 10\u22125.\n\nx2 + \u0001 with \u0001 (cid:28) 1.\n\nFurther, note that in the \ufb01nal optimization problem (9), the unobserved noise values \u0001 can in fact\nalso be regarded as additional hyperparameters, similar to the GPLVM model [16]. In our setting,\nthis optimization is particularly challenging, as the number of parameters exceeds the number of\nobservations. In particular, for small length scales \u03bbX and \u03bbE the objective function may exhibit\na large number of local minima. In our implementation we applied the following measures to deal\nwith this issue:\n\n\u2022 We initialize \u0001 with an additive noise model, by taking the residuals from a standard GP re-\ngression as initial values for \u0001. The reason for doing this is that in an additive noise model, all\npartial derivatives \u2202f\n\u2202e are positive and constant. This initialization effectively leads to a solution\nthat satis\ufb01es the invertability assumption that we made in approximating the evidence.5\n\n\u2022 We implemented a log barrier that heavily penalized negative values of \u2202f\n\n\u2202e . This was done\nto avoid sign \ufb02ips of these terms that would violate the invertability assumption. Basically,\ntogether with our earlier regularization of the logarithm, we replaced the logarithms log |x| in\n\n5This is related in spirit to the standard initialization of GPLVM models by PCA.\n\n6\n\n\fthe last term in (9) by:\n\nlog(cid:112)(x \u2212 \u0001)2 + \u0001 + A(cid:0) log(cid:112)(x \u2212 \u0001)2 + \u0001 \u2212 log\n\n\u0001(cid:1)1x\u2264\u0001\n\n\u221a\n\nwith \u0001 (cid:28) 1. We used \u0001 = 10\u22123 and A = 102.\n\nThe resulting optimization problem can be solved using standard numerical optimization methods\n(we used LBFGS). The source code of our implementation is available as supplementary material\nand can also be downloaded from http://webdav.tuebingen.mpg.de/causality/.\n\n3 Experiments\n\nTo evaluate the ability of our method to identify causal directions, we have tested our approach\non simulated and real-world data. To identify the most probable causal direction, we evaluate the\nmarginal likelihoods corresponding to both possible causal directions (which are given by combin-\ning the results of equations (6) and (9)), choosing the model that assigns higher probability to the\nobserved data. We henceforth refer to this approach as GPI-MML. For comparison, we also con-\nsidered the marginal likelihood using a GP covariance function that is constant with respect to e,\ni.e., assuming additive noise. For this special case, the noise values e can be integrated out ana-\nlytically, resulting in standard GP regression. We call this approach AN-MML. We also compare\nwith the method proposed in [1], which also uses an additive noise GP regression for the conditional\nmodel, but uses a simple Gaussian model for the input distribution p(x). We refer to this approach\nas AN-GAUSS.\nWe complemented the marginal likelihood as selection criterion with another possible criterion for\ncausal model selection: the independence of the cause and the estimated noise [5]. Using HSIC [17]\nas test criterion for independence, this approach can be applied to both the additive noise GP and\nthe more general latent variable approach. As the marginal likelihood does not provide a signif-\nicance level for the inferred causal direction, we used the ratio of the p-values of HSIC for both\ncausal directions as prediction criterion, preferring the direction with a higher p-value (i.e., with\nless dependence between the estimated noise and the cause). HSIC as selection criterion applied\nto the additive or general Gaussian process model will be referred to as AN-HSIC and GPI-HSIC\nrespectively.\nWe compared these methods with other related methods: IGCI [13], a method that is also based\non assumption (C), although designed for the noise-free case; LINGAM [2], which assumes a linear\ncausal mechanism; and PNL, the Post-NonLinear model [6]. We evaluated all methods in the \u201cforced\ndecision\u201d scenario, i.e., the only two possible decisions that a method could take were X \u2192 Y and\nY \u2192 X (so decisions like \u201cboth models \ufb01t the data\u201d or \u201dneither model \ufb01ts the data\u201d were not\npossible).\n\nSimulated data Inspired by the experimental setup in [5], we generated simulated datasets from\nthe model Y = (X+bX 3)e\u03b1E +(1\u2212\u03b1)E. Here, the random variables X and E where sampled from\na Gaussian distribution with their absolute values raised to the power q, while keeping the original\nsign. The parameter \u03b1 controls the type of the observation noise, interpolating between purely\nadditive noise (\u03b1 = 0) and purely multiplicative noise (\u03b1 = 1). The coef\ufb01cient b determines the non-\nlinearity of the true causal model, with b = 0 corresponding to the linear case. Finally, the parameter\nq controls the non-Gaussianity of the input and noise distributions: q = 1 gives a Gaussian, while\nq > 1 and q < 1 produces super-Gaussian and sub-Gaussian distributions respectively.\nFor alternative parameter settings \u03b1, b and q, we generated D = 40 independent datasets. Each\ndataset consisted of N = 500 samples from the corresponding generative model. Figure 2 shows\nthe accuracy of the considered methods evaluated on these simulated datasets. Encouragingly, GPI\nappears to be robust with respect to the type of noise, outperforming additive noise models in the\nfull range between additive and multiplicative noise (Figure 2a). Note that the additive noise models\nactually yield the wrong decision for high values of \u03b1, whereas the GPI methods stay well above\nchance level. Figure 2b shows accuracies for a linear model and a non-Gaussian noise and input\ndistribution. Figure 2c shows accuracies for a non-linear model with Gaussian additive noise. We\nobserve that GPI-MML performs well in each scenario. Further, we observe that AN-GAUSS, the\nmethod proposed in [1], only performs well for Gaussian input distributions and additive noise.\n\n7\n\n\f(a) From additive to multiplicative noise\n\n(b) Linear function, non-Gaussian additive\nnoise\n\n(c) Non-linear function, Gaussian additive\nnoise\n\n(d) Legend\n\nFigure 2: Accuracy of recovering the true causal direction in simulated datasets. (a) From additive\n(\u03b1 = 0) to multiplicative noise (\u03b1 = 1), for q = 1 and b = 1; (b) from sub-Gaussian noise (q < 1),\nGaussian noise (q = 1) to super-Gaussian noise (q > 1), for a linear function (b = 0) with additive\nnoise (\u03b1 = 0); (c) from non-linear (b < 0) to linear (b = 0) to non-linear (b > 1), with additive\nGaussian noise (q = 1,\u03b1 = 0).\n\nTable 1: Accuracy (in percent) of recovering the true causal direction in 68 real world datasets.\n\nAN-MML AN-HSIC AN-GAUSS GPI-MML GPI-HSIC\n62 \u00b1 4\n68 \u00b1 1\n\n45 \u00b1 3\n\n68 \u00b1 3\n\n72 \u00b1 2\n\nIGCI\n76 \u00b1 1\n\nLINGAM\n62 \u00b1 3\n\nPNL\n67 \u00b1 4\n\nResults on cause-effect pairs Next, we applied the same methods and selection criteria to real-\nworld cause-effect pairs where the true causal direction is known. The data was obtained from\nhttp://webdav.tuebingen.mpg.de/cause-effect/. We considered a total of 68\npairs in this dataset collected from a variety of domains. To reduce computation time, we subsam-\npled the data, using a total of at most N = 500 samples for each cause-effect pair. Table 1 shows\nthe prediction accuracy for the same approaches as in the simulation study, reporting averages and\nstandard deviations estimated from 3 repetitions of the experiments with different subsamples.\n\n4 Conclusions and discussion\n\nWe proposed the \ufb01rst method (to the best of our knowledge) for addressing the challenging task of\ndistinguishing between cause and effect without an a priori restriction to a certain class of mod-\nels. The method compares marginal likelihoods that penalize complex input distributions and causal\nmechanisms. Moreover, our framework generalizes a number of existing approaches that assume a\nlimited class of possible causal mechanisms functions. A more extensive evaluation of the perfor-\nmance of our method has to be performed in future. Nevertheless, the encouraging results that we\nhave obtained thus far con\ufb01rm the hypothesis that asymmetries of the joint distribution of cause and\neffect provide useful hints on the causal direction.\n\nAcknowledgments\n\nWe thank Stefan Harmeling and Hannes Nickisch for fruitful discussions. We also like to thank the\nauthors of the GPML toolbox [18], which was very useful during the development of our software.\nOS was supported by a fellowship from the Volkswagen Foundation.\n\n8\n\n00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91Accuracy\u03b10.20.40.60.811.21.41.61.800.10.20.30.40.50.60.70.80.91Accuracyq\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8100.10.20.30.40.50.60.70.80.91Accuracyb AN\u2212MMLAN\u2212HSICAN\u2212GAUSSGPI\u2212MMLGPI\u2212HSICIGCI\fReferences\n\n[1] N. Friedman and I. Nachman. Gaussian process networks. In Proc. of the 16th Annual Conference on\n\nUncertainty in Arti\ufb01cial Intelligence, pages 211\u2013219, 2000.\n\n[2] S. Shimizu, P. O. Hoyer, A. Hyv\u00a8arinen, and A. J. Kerminen. A linear non-Gaussian acyclic model for\n\ncausal discovery. Journal of Machine Learning Research, 7:2003\u20132030, 2006.\n\n[3] X. Sun, D. Janzing, and B. Sch\u00a8olkopf. Causal inference by choosing graphs with most plausible Markov\n\nkernels. In Proceeding of the 9th Int. Symp. Art. Int. and Math., Fort Lauderdale, Florida, 2006.\n\n[4] X. Sun, D. Janzing, and B. Sch\u00a8olkopf. Distinguishing between cause and effect via kernel-based com-\n\nplexity measures for conditional probability densities. Neurocomputing, pages 1248\u20131256, 2008.\n\n[5] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch\u00a8olkopf. Nonlinear causal discovery with\nIn D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in\n\nadditive noise models.\nNeural Information Processing Systems 21 (NIPS*2008), pages 689\u2013696, 2009.\n\n[6] K. Zhang and A. Hyv\u00a8arinen. On the identi\ufb01ability of the post-nonlinear causal model. In Proceedings of\n\nthe 25th Conference on Uncertainty in Arti\ufb01cial Intelligence, Montreal, Canada, 2009.\n\n[7] D. Janzing, P. Hoyer, and B. Sch\u00a8olkopf. Telling cause from effect based on high-dimensional observations.\nIn Proceedings of the International Conference on Machine Learning (ICML 2010), pages 479\u2013486, 2010.\n[8] J. M. Mooij and D. Janzing. Distinguishing between cause and effect. In Journal of Machine Learning\n\nResearch Workshop and Conference Proceedings, volume 6, pages 147\u2013156, 2010.\n\n[9] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer-Verlag, 1993. (2nd\n\ned. MIT Press 2000).\n\n[10] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.\n[11] J. Lemeire and E. Dirkx.\n\nCausal models as minimal descriptions of multivariate systems.\n\nhttp://parallel.vub.ac.be/\u223cjan/, 2006.\n\n[12] D. Janzing and B. Sch\u00a8olkopf. Causal inference using the algorithmic Markov condition. IEEE Transac-\n\ntions on Information Theory, 56(10):5168\u20135194, 2010.\n\n[13] P. Daniu\u02c7sis, D. Janzing, J. M. Mooij, J. Zscheischler, B. Steudel, K. Zhang, and B. Sch\u00a8olkopf. Inferring\ndeterministic causal relations. In Proceedings of the 26th Annual Conference on Uncertainty in Arti\ufb01cial\nIntelligence (UAI-10), 2010.\n\n[14] A. Hyv\u00a8arinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness\n\nresults. Neural Networks, 12(3):429\u2013439, 1999.\n\n[15] M. A. T. Figueiredo and A. K. Jain. Unsupervised learning of \ufb01nite mixture models. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 24(3):381\u2013396, March 2002.\n\n[16] N. D. Lawrence. Gaussian process latent variable models for visualisation of high dimensional data. In\nAdvances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference, page 329.\nThe MIT Press, 2004.\n\n[17] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch\u00a8olkopf. Kernel methods for measuring\n\nindependence. Journal of Machine Learning Research, 6:2075\u20132129, 2005.\n\n[18] C. E. Rasmussen and H. Nickisch. Gaussian Processes for Machine Learning (GPML) Toolbox. Journal\n\nof Machine Learning Research, accepted, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1270, "authors": [{"given_name": "Oliver", "family_name": "Stegle", "institution": null}, {"given_name": "Dominik", "family_name": "Janzing", "institution": null}, {"given_name": "Kun", "family_name": "Zhang", "institution": null}, {"given_name": "Joris", "family_name": "Mooij", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}