{"title": "On Causal Discovery with Cyclic Additive Noise Models", "book": "Advances in Neural Information Processing Systems", "page_first": 639, "page_last": 647, "abstract": "We study a particular class of cyclic causal models, where each variable is a (possibly nonlinear) function of its parents and additive noise. We prove that the causal graph of such models is generically identifiable in the bivariate, Gaussian-noise case. We also propose a method to learn such models from observational data. In the acyclic case, the method reduces to ordinary regression, but in the more challenging cyclic case, an additional term arises in the loss function, which makes it a special case of nonlinear independent component analysis. We illustrate the proposed method on synthetic data.", "full_text": "On Causal Discovery with\n\nCyclic Additive Noise Models\n\nJoris M. Mooij\n\nRadboud University\n\nNijmegen, The Netherlands\n\nj.mooij@cs.ru.nl\n\nTom Heskes\n\nRadboud University\n\nNijmegen, The Netherlands\nt.heskes@cs.ru.nl\n\nMax Planck Institute for Intelligent Systems\n\nDominik Janzing\n\nT\u00a8ubingen, Germany\n\ndominik.janzing@tuebingen.mpg.de\n\nBernhard Sch\u00a8olkopf\n\nMax Planck Institute for Intelligent Systems\n\nT\u00a8ubingen, Germany\n\nbs@tuebingen.mpg.de\n\nAbstract\n\nWe study a particular class of cyclic causal models, where each variable is a (possi-\nbly nonlinear) function of its parents and additive noise. We prove that the causal\ngraph of such models is generically identi\ufb01able in the bivariate, Gaussian-noise\ncase. We also propose a method to learn such models from observational data. In\nthe acyclic case, the method reduces to ordinary regression, but in the more chal-\nlenging cyclic case, an additional term arises in the loss function, which makes\nit a special case of nonlinear independent component analysis. We illustrate the\nproposed method on synthetic data.\n\n1\n\nIntroduction\n\nCausal discovery refers to a special class of statistical and machine learning methods that infer\ncausal relationships between variables from data and prior knowledge [1, 2, 3]. Whereas in machine\nlearning, one traditionally concentrates on the task of predicting the values of variables given obser-\nvations of other variables (for example in regression or classi\ufb01cation tasks), causal discovery focuses\non predicting the results of interventions on the system: if one forces one (or more) of the variables\ninto a particular state, how will the probability distribution of the other variables be affected? In this\nsense, causal discovery concentrates more on inferring the underlying mechanism that generated the\ndata than on modeling the data itself.\nAn important assumption often made in causal discovery is that the causal mechanism is acyclic, i.e.,\nthat no feedback loops are present in the system. For example, if A causes B, and B causes C, then\nthe possibility that C also causes A is usually excluded from the outset. This acyclicity assumption\nis useful because it simpli\ufb01es the theoretical analysis and often is also a reasonable assumption to\nmake. Nevertheless, causal cycles are known to occur frequently in biological systems such as gene\nregulatory networks and protein interaction networks. One would expect that taking such feedback\nloops into account during data analysis should therefore signi\ufb01cantly improve the quality of the\ninferred causal structure.\nEssentially two strategies for dealing with cycles in causal models can be distinguished. The \ufb01rst\none is to perform repeated measurements in time, and to infer a causal model for the dynamics of\nthe underlying system. The fact that causes always precede their effects provides additional prior\nknowledge that simpli\ufb01es causal discovery, which is exploited in methods based on Granger causal-\nity [4]. Additionally, under certain assumptions, \u201cunrolling\u201d the model in time effectively removes\nthe cycles, which is used in methods such as vector auto-regressive models, which are popular in\n\n1\n\n\feconometrics, or more generally, Dynamic Bayesian Networks [5] and ordinary differential equa-\ntion models. However, all these methods need time series data where the temporal resolution of the\nmeasurements is high relative to the characteristic time scale of the feedback loops in order to rule\nout instantaneous cyclic relationships. Therefore, a signi\ufb01cant practical drawback of this strategy\nis that obtaining time series data with suf\ufb01ciently high temporal resolution is often costly\u2014or even\nimpossible\u2014using current technology.\nThe second strategy is based on the assumption that the system is in equilibrium, and that the data\nhave been gathered from an equilibrium distribution (in the ergodic case, the data can also consist of\nsnapshots of the dynamical system, taken at different points in time). The equilibrium distribution\nis then used to draw conclusions about the underlying dynamic system, and to predict the results\nof interventions. This is the approach taken in the current paper. We assume the equilibrium to be\ndescribed by \ufb01xed point equations, where each variable is a function of some other variables, plus\nnoise. This noise models unobserved causes and is assumed to be different for each independent\nrealization of the system, but constant during equilibration. In the simplest case (assuming causal\nsuf\ufb01ciency), the noise terms are jointly independent. Together, these assumptions de\ufb01ne an interest-\ning model class that forms a direct generalization of Structural Equation Models (SEMs) [2] to the\nnonlinear (and cyclic) case.\nAn important novel aspect of our work is that we consider continuous-valued variables and nonlin-\near causal mechanisms. Although the linear case has been studied in considerable detail already\n[6, 7, 8], as far as we know, nobody has yet investigated the (more realistic) case of nonlinear causal\nmechanisms. The basic assumption made in [7] is the so-called Global Directed Markov Condition,\nwhich relates (conditional) independences between the variables with the structure of the causal\ngraph. In the cyclic case, however, it is not obvious what the relationship is with the class of nonlin-\near causal models that we consider here. Therefore, direct generalization of the algorithm proposed\nin [7] to the nonlinear case seems dif\ufb01cult. Furthermore, conditional independences only allow\nidenti\ufb01cation of the graph up to Markov equivalence classes. For instance, in the bivariate case,\none cannot distinguish between X \u2192 Y , Y \u2192 X and X (cid:28) Y using conditional independences\nalone. Researchers have also studied cyclic causal models with discrete variables [9, 10]. However,\nif the measured variables are intrinsically continuous-valued, it is desirable to avoid discretization\nas a preprocessing step, as this throws away information that is useful for causal discovery.\n\n2 Cyclic additive noise models\n\nLet V be a \ufb01nite index set. Let (Xi)i\u2208V be random variables modeling measurable properties of the\nsystem of interest and let (Ei)i\u2208V be other random variables modeling unobservable noise sources.\nWe assume that all random variables take values in the real numbers. We also assume that the noise\nvariables (Ei)i\u2208V have densities and are jointly independent:\n\np(eV ) =\n\npEi(ei).\n\n(1)\n\n(cid:89)\n\ni\u2208V\n\nFor each i, let pa(i) \u2286 V \\ {i} be a set de\ufb01ning the parents of i and fi : R|pa(i)| \u2192 R be a con-\ntinuously differentiable function. Under certain assumptions (see below), the following equations\nspecify a unique probability distribution on the observable variables (Xi)i\u2208V :\n\nXi = fi(Xpa(i)) + Ei,\n\ni \u2208 V.\n\n(2)\n\nUsing vector notation, we can write the \ufb01xed point equations (2) in a more compact manner as\n\nX = f (X) + E.\n\n(3)\nThe probability distribution p(X) induced by these equations is interpreted as the equilibrium dis-\ntribution of an underlying dynamic system. Each function fi represents a causal mechanism which\ndetermines Xi as a function of its parents Xpa(i), which model its direct causes. The noise variables\ncan be interpreted as other, unobserved causes for their corresponding variables. By assuming inde-\npendence of the noise variables, we are assuming causal suf\ufb01ciency, or in other words, absence of\nconfounders (hidden common causes).\nWe call a model speci\ufb01ed by (1) and (2) an additive noise model. With any additive noise model\nwe can associate a directed graph with vertices V and directed edges i \u2192 j if i \u2208 pa(j), i.e., from\n\n2\n\n\fcauses to their direct effects.1 If this graph is acyclic, we call the model an acyclic additive noise\nmodel. If the graph contains (directed) cycles, we call the model a cyclic additive noise model.2\nInterpretation in the cyclic case\nNote that the presence of cycles increases the complexity of the model, because the equations (2)\nbecome recursive. The interpretation of these equations also becomes less straightforward in the\ncyclic case. In general, for a \ufb01xed noise value E = e, the \ufb01xed point equations x = f (x) + e can\nhave any number of \ufb01xed points between 0 and \u221e. For simplicity, however, we will assume that for\neach noise value e there exists a unique \ufb01xed point x = F (e). Later, in Section 3.1, we will give a\nsuf\ufb01cient condition for this to be the case. Under this assumption, the joint probability distribution\np(E) induces a unique joint probability distribution p(X).\nThis interpretation also shows a way to sample from the joint distribution: First, one samples a\njoint value of the noise e. Then, one iterates the \ufb01xed point equations (2) to \ufb01nd the corresponding\n\ufb01xed point x = F (e). This yields one sample x. Different independent samples are obtained by\nrepeating this process. Thus, the equations can be interpreted as the equilibrium distribution of a\ndynamic system in the presence of noise which is constant during equilibration, but differs across\nmeasurements (data points). If in reality the noise does change over time, but on a slow time scale\nrelative to the time scale of the equilibration, then this model can be considered as the \ufb01rst-order\napproximation.\nThe induced density\nAlthough the mapping F : e (cid:55)\u2192 x that maps noise values to their corresponding \ufb01xed points under\n(3) is nontrivial in most cases, a crucial observation is that its inverse G = F \u22121 = I \u2212 f has\na very simple form (here, I is the identity mapping). Under the change of variables e (cid:55)\u2192 x, the\ntransformation rule of the densities reads:\n\n(cid:0)x \u2212 f (x)(cid:1)|I \u2212 \u2207f (x)| = |I \u2212 \u2207f (x)|(cid:89)\n\n(cid:0)xi \u2212 fi(xpa(i))(cid:1)\n\npX (x) = pE\n\n(4)\n\npEi\n\ni\u2208V\n\nwhere \u2207f (x) is the Jacobian of f evaluated at x and |\u00b7| denotes the absolute value of the determinant\nof a matrix.\nNote that although sampling from the distribution pX is elaborate (as it typically involves many\niterations of the \ufb01xed point equations), the corresponding density can be easily expressed analyti-\ncally in terms of the noise distributions and partial derivatives of the causal mechanisms. Later we\nwill see that the fact that the model has a simple structure in the \u201cbackwards\u201d direction allows us to\nef\ufb01ciently learn it from data, which may be surprising considering the fact that the model is complex\nin the \u201cforward\u201d direction.\nCausal interpretation\nAn additive noise model can be used for ordinary prediction tasks (i.e., predict some of the variables\nconditioned on observations of some other variables), but can also be used to predict the results\nof interventions: if we force some of the variables to certain values, what will happen with the\nothers? Such an intervention can be modeled by replacing the equations for the intervened variables\nby simple equations Xi = Ci, with Ci the value set by the intervention. This procedure results\nin another additive noise model. If the altered \ufb01xed point equations induce a unique probability\ndistribution on X, then this is the predicted distribution on X under the intervention. In this sense,\nadditive noise models are given a causal interpretation. Hereafter, we will therefore refer to the\ngraph associated with the additive noise model as the causal graph.\n\n3\n\nIdenti\ufb01ability\n\nAn interesting and important question for causal discovery is under which conditions the causal\ngraph is identi\ufb01able given only the joint distribution p(X). Lacerda et al. [8] have shown that under\n\n1If some causal mechanism fj does not depend on one of its parents i \u2208 pa(j), i.e., if \u2202fj\n\n(Xpa(j)) = 0\n\n\u2202Xi\n\neverywhere, then we discard the edge i \u2192 j.\n\n2Cyclic additive noise models are also known as \u201cnon-recursive\u201d (nonlinear) structural equation models,\nwhereas the acyclic versions are known as \u201crecursive\u201d (nonlinear) SEMs. This terminology is common usage\nbut confusing, as it is precisely in the cyclic case that one needs a recursive procedure to calculate the solutions\nof equations (2), and not the other way around.\n\n3\n\n\fthe additional assumption of linearity (i.e., all functions fi are linear), the causal graph is completely\nidenti\ufb01able if at most one of the noise sources has a Gaussian distribution. The proof is based on\nIndependent Component Analysis. Our aim here is to deal with the more dif\ufb01cult nonlinear case. In\nthis work, we focus our attention on the bivariate case. Our main result, Theorem 1, can be seen as\nan extension of the identi\ufb01ability result for acyclic nonlinear additive noise models derived in [11],\nalthough we make the additional simplifying assumption that the noise variables are Gaussian. We\nbelieve that similar identi\ufb01ability results can be derived in the multivariate case (|V | > 2) and for\nnon-Gaussian noise distributions. However, proving such results seems to be signi\ufb01cantly harder as\nthe calculations become very cumbersome, and we leave this as an open problem for future work.\n\n3.1 The bivariate case\n\nX (y)f(cid:48)\n\nBefore we state our identi\ufb01ability result, we \ufb01rst give a suf\ufb01cient condition for existence of a unique\nequilibrium distribution for the bivariate case.\nLemma 1 Consider the \ufb01xed point equations x = fX (y) + cX, y = fY (x) + cY parameterized\nby constants (cX , cY ). If supx,y |f(cid:48)\nY (x)| = r < 1, then for any (cX , cY ), the \ufb01xed point\nequations converge to a unique \ufb01xed point that does not depend on the initial conditions.\nProof. Consider the mapping de\ufb01ned by applying the \ufb01xed point equations twice. Its Jacobian\nis diagonal and the absolute values of the entries are bounded from above by r < 1 under the\nassumption above. According to Banach\u2019s \ufb01xed point theorem, it is a contraction (e.g., with respect\nto the Euclidean norm on R2) and therefore has a \ufb01xed point that is unique. Independent of the initial\nconditions, under repeated application of this mapping, one converges to this \ufb01xed point. Lemma 1\nin the supplement then shows that the same conclusion must hold for the mapping that applies the\n(cid:3)\n\ufb01xed point equations only once.\n\nThis lemma provides a suf\ufb01cient condition for an additive noise model to be well-de\ufb01ned in the\nbivariate case. Also, the result of any intervention will be well-de\ufb01ned under this condition.\nNow suppose we are given the joint distribution pX,Y of two real-valued random variables X, Y\nwhich is induced by an additive noise model. The question is whether we can identify the causal\ngraph corresponding with the true model out of the four possibilities (X Y , X \u2192 Y , Y \u2192 X,\nX (cid:28) Y ). Hoyer et al. [11] have shown that if one excludes the cyclic case X (cid:28) Y , then in the\ngeneric case, the causal structure is identi\ufb01able. Our aim is to prove a stronger identi\ufb01ability result\nwhere the cyclic case is not excluded a priori. As a \ufb01rst step in this direction, we consider here the\ncase of Gaussian noise.\nTheorem 1 Let pX,Y be induced by two additive Gaussian noise models, M and \u02dcM:\nX = fX (Y ) + EX , Y = fY (X) + EY , EX \u22a5\u22a5 EY , EX \u223c N (0, \u03b1\u22121\nX = \u02dcfX (Y ) + \u02dcEX , Y = \u02dcfY (X) + \u02dcEY , \u02dcEX \u22a5\u22a5 \u02dcEY , \u02dcEX \u223c N (0, \u02dc\u03b1\u22121\nAssuming that supx,y |f(cid:48)\nX (y) \u02dcf(cid:48)\ncorresponding causal graphs coincide: GM = G \u02dcM, i.e.:\n\n(cid:12)(cid:12)(cid:12) < 1, then the two\n\nX ), EY \u223c N (0, \u03b1\u22121\nY )\nX ), \u02dcEY \u223c N (0, \u02dc\u03b1\u22121\nY )\n\nY (x)| < 1 and similarly supx,y\n\n(M)\n( \u02dcM)\n\n(cid:12)(cid:12)(cid:12) \u02dcf(cid:48)\n\nX (y)f(cid:48)\n\nY (x)\n\nfX is constant \u21d0\u21d2 \u02dcfX is constant,\n\nand\nor the models are of the following very special form:\n\nfY is constant \u21d0\u21d2 \u02dcfY is constant,\n\n\u2022 either: fX, \u02dcfX, fY , \u02dcfY are all af\ufb01ne,\n\u2022 or: one model (say \u02dcM) is acyclic, the other is cyclic, and the following equations hold:\n\nfY (x) = Cx + D with C (cid:54)= 0, fX (y) =\nand \u02dcfX satis\ufb01es the following differential equation:3\n\u02dcfX \u2212 \u03b1Y Cy + \u03b1Y CD)(\u02dc\u03b1X\n\n\u02dc\u03b1X\n\u03b1X\n\n(\u02dc\u03b1X\n\n\u2212 1\n\u03b1X\n\n= \u03b1Y (y \u2212 D) \u2212 \u02dc\u03b1Y (y \u2212 \u02dcD) + C\n\n3Or similar equations with the roles of X and Y reversed.\n\n4\n\n\u02dcfX (y)\u2212 \u03b1Y\n\u03b1X\n\nCy +\n\n\u03b1Y\n\u03b1X\n\nCD, \u02dcfY (x) = \u02dcD (5)\n\nX \u2212 \u03b1Y C) + \u02dc\u03b1X\n\u02dcf(cid:48)\n\n\u02dcfX\n\n\u02dcf(cid:48)\n\nX\n\n\u02dcf(cid:48)(cid:48)\n\u02dc\u03b1X\n\u03b1X \u2212 (\u02dc\u03b1X\nX \u2212 \u03b1Y C)C\n\u02dcf(cid:48)\n\nX\n\n.\n\n(6)\n\n\fWe will only sketch the proof here, and refer to the supplementary material for the details. What\nthe theorem shows is that, apart from a small class of exceptions, bivariate additive Gaussian-noise\nmodels induce densities that allow a perfect reconstruction of the causal graph. In a certain sense,\nthe situation can be seen as similar to the well-known \u201cfaithfulness assumption\u201d [3]:\nthe latter\nassumption is often made in order to exclude the highly special cases of causal models which would\nspoil identi\ufb01ability of the Markov equivalence class. The usual reasoning is that these cases are so\nrare that they can be ignored in practice. A similar reasoning can be made in our case.\nAlthough our main identi\ufb01ability result, Theorem 1, may seem rather restricted as it only considers\ntwo variables, it may be possible to use this two-variable identi\ufb01ability result as a key building block\nfor deriving more general identi\ufb01ability results for many variables, similar as how [12] generalized\nthe (acyclic) identi\ufb01ability result of [11] from two to many variables.\n\n\u03c0X,Y (x, y) = \u03c0EX\n\n(cid:0)x \u2212 fX (y)(cid:1) + \u03c0EY\n\n3.2 Proof sketch\nWriting \u03c0\u00b7\u00b7\u00b7(\u00b7\u00b7\u00b7 ) := log p\u00b7\u00b7\u00b7(\u00b7\u00b7\u00b7 ) for logarithms of densities, we reexpress (4) for the bivariate case:\n(7)\nPartial differentiation with respect to x and y yields the following equation, which will be the equa-\ntion on which we base our identi\ufb01ability proof:\nX (y) \u2212 \u03c0(cid:48)(cid:48)\n\n(cid:0)y \u2212 fY (x)(cid:1) + log |1 \u2212 f(cid:48)\n(cid:0)y \u2212 fY (x)(cid:1)f(cid:48)\n(cid:0)1 \u2212 f(cid:48)\n\n(cid:0)x \u2212 fX (y)(cid:1)f(cid:48)\n\n\u22022\u03c0X,Y\n\u2202x\u2202y\n\nY (x) \u2212\n\nX (y)f(cid:48)\n\n= \u2212\u03c0(cid:48)(cid:48)\n\nY (x)|\n\nY (x)(cid:1)2\n\nY (x)\n\n(8)\n\nEX\n\nEY\n\nWe will now specialize to Gaussian noise and give a sketch of how to prove identi\ufb01ability of the\ncausal graph. We assume EX \u223c N (0, \u03b1\u22121\nX , \u03b1Y = \u03c3\u22122\nare the precisions (inverse variances) of the Gaussian noise variables. Equation (8) simpli\ufb01es to:\n\nX ) and EY \u223c N (0, \u03b1\u22121\n\nY\n\nX (y)f(cid:48)(cid:48)\nf(cid:48)(cid:48)\nX (y)f(cid:48)\nY ) where \u03b1X = \u03c3\u22122\n\n\u22022\u03c0X,Y\n\u2202x\u2202y\n\n= \u03b1X f(cid:48)\n\nX (y) + \u03b1Y f(cid:48)\n\nY (x) \u2212\n\nA similar equation holds for the other model:\n\n\u22022\u03c0X,Y\n\u2202x\u2202y\n\n= \u02dc\u03b1X\n\n\u02dcf(cid:48)\nX (y) + \u02dc\u03b1Y\n\nY (x) \u2212\n\u02dcf(cid:48)\n\n(cid:0)1 \u2212 f(cid:48)\n(cid:0)1 \u2212 \u02dcf(cid:48)\n\nX (y)f(cid:48)(cid:48)\nf(cid:48)(cid:48)\nX (y)f(cid:48)\n\nY (x)\n\nX (y) \u02dcf(cid:48)(cid:48)\n\u02dcf(cid:48)(cid:48)\nX (y) \u02dcf(cid:48)\n\nY (x)\n\nY (x)(cid:1)2\nY (x)(cid:1)2\n\n(9)\n\n(10)\n\nX (cid:54)= 0, \u02dcf(cid:48)\n\nX = 0 and \u02dcf(cid:48)\n\nY = 0; (ii) model \u02dcM has one \u201carrow\u201d, say, \u02dcf(cid:48)\n\nThe general idea of the identi\ufb01ability proof is as follows. We consider two cases: (i) model \u02dcM has\nzero \u201carrows\u201d, i.e., \u02dcf(cid:48)\nY = 0.\nBy equating the r.h.s.\u2019s of (9) and (10), we show in both cases that generically (i.e., except for very\nspecial choices of the model parameters), model M must equal model \u02dcM. This then implies that\nthe causal graphs of M and \u02dcM must be the same in the generic case.\nFor example, in the \ufb01rst case, because \u02dcf(cid:48)\nX (y) + \u03b1Y f(cid:48)\n\n(11)\nThis is a nonlinear partial differential equation in \u03c6(x) := f(cid:48)\nX (y). Inspired\nby the identi\ufb01ability proof in [13], we adopt the solution method from [14, Supplement S.4.3] that\ngives a general method for solving functional-differential equations of the form\n\u03a61(x)\u03a81(y) + \u03a62(x)\u03a82(y) + \u00b7\u00b7\u00b7 + \u03a6k(x)\u03a8k(y) = 0\nwhere the functionals \u03a6i(x) and \u03a8i(y) depend only on x and y, respectively:\n\nY (x)\nY (x) and \u03c8(y) := f(cid:48)\n\nY (x)(cid:1)(cid:0)1 \u2212 f(cid:48)\n\nY = 0, we obtain the following equation:\n\nY (x)(cid:1)2 \u2212 f(cid:48)(cid:48)\n\n0 =(cid:0)\u03b1X f(cid:48)\n\nX (y)f(cid:48)(cid:48)\n\nX (y)f(cid:48)\n\nX = \u02dcf(cid:48)\n\n(12)\n\n\u03a6i(x) = \u03a6i(x, \u03c6, \u03c6(cid:48)),\n\n\u03a8i(y) = \u03a8i(y, \u03c8, \u03c8(cid:48)).\n\nThe idea behind the solution method is to repeatedly divide by one of the functionals and differenti-\nate with respect to the corresponding variable. For example, dividing by \u03a61 and differentiating with\n\nrespect to x, we obtain:(cid:18) \u2202\n\n(cid:19)\n\n\u03a62(x)\n\u03a61(x)\n\n\u2202x\n\n(cid:18) \u2202\n\n(cid:19)\n\n\u03a6k(x)\n\u03a61(x)\n\n\u2202x\n\n\u03a8k(y) = 0\n\n\u03a82(y) + \u00b7\u00b7\u00b7 +\n\n5\n\n\fwhich is again of the form (12), but with one fewer term. This process is repeated until an equation\nof the form (12) remains with only 2 terms. That equation is easily solved, as its general solution\ncan be written as\n\nC1\u03a61(x) + C2\u03a62(x) = 0, C2\u03a81(y) \u2212 C1\u03a82(y) = 0\n\nfor arbitrary constants C1, C2 \u2208 R, and there are also two degenerate solutions \u03a61 = \u03a62 = 0\n(and \u03a81, \u03a82 arbitrary) and \u03a81 = \u03a82 = 0 (and \u03a61, \u03a62 arbitrary). These equations, which are\nnow ordinary differential equations, can be solved by standard methods. The solutions are then\nsubstituted into the original equation (12) in order to remove redundant constants of integration.\nX and f(cid:48)\nApplying this method to the case at hand, one obtains equations for f(cid:48)\nY . Solving these\nequations, one \ufb01nds that either M = \u02dcM, or that f(cid:48)\nX = \u02dcf(cid:48)\nY = 0. In the second case\n(where \u02dcM has one arrow) the equations show that either M = \u02dcM, or the model parameters should\nsatisfy equations (5) and (6).\n\nX = f(cid:48)\n\nY = \u02dcf(cid:48)\n\n4 Learning additive noise models from observational data\nIn this section, we propose a method to learn an additive noise model from a \ufb01nite data set D :=\n{x(n)}N\nn=1. We will only describe the bivariate case in detail, although the method can be extended\nto more than two variables in a straightforward way.\nWe \ufb01rst consider how we can learn the causal mechanisms {fi}i\u2208V for a \ufb01xed causal structure. This\ncan be done ef\ufb01ciently by a MAP estimate with respect to (the parameters of) the causal mechanisms.\nUsing (4), the MAP problem can be written as:\n\nN(cid:89)\n\n(cid:32)(cid:12)(cid:12)(cid:12)I \u2212 \u2207 \u02c6f(cid:0)x(n)(cid:1)(cid:12)(cid:12)(cid:12)(cid:89)\n\ni\u2208V\n\n(cid:16)\n\nargmax\n\np( \u02c6f )\n\n\u02c6f\n\nn=1\n\n(cid:1)(cid:17)(cid:33)\n\n(cid:0)x(n)\n\npa(i)\n\npEi\n\ni \u2212 \u02c6fi\nx(n)\n\n(13)\n\nwhere p( \u02c6f ) speci\ufb01es the prior distribution of the causal mechanisms. Note the presence of the\ndeterminant; in the acyclic case, this term becomes 1, and the method reduces to standard regression.\nIn the cyclic case, however, the determinant is necessary in order to penalize dependencies between\nthe estimated noise variables. One can consider this as a special case of nonlinear independent\ncomponent analysis, as the MAP estimate (13) can also be interpreted as the minimizer of the mutual\ninformation between the noise variables. If the estimated functions lead to noise estimates \u02c6Ei =\nXi \u2212 \u02c6fi(Xpa(i)) which are mutually independent according to some independence test, then we\naccept the model.\nOne can try all possible causal graph structures and test which ones \ufb01t the data. The models that\nlead to independent estimated noise values are possible causal explanations of the data. If multiple\nmodels with different causal graphs lead to independent estimated noise values, we prefer models\nwith fewer arrows in the graph.4 If the number of data points is large enough, Theorem 1 suggests\nthat for two variables with Gaussian noise, in the generic case, a unique causal structure will be\nidenti\ufb01ed in this way. For more than two variables, and for other noise distributions, the method can\nstill be applied, but we do not know whether (in general and asymptotically) there will be a unique\ncausal structure that explains the data.\nWe now work out the bivariate Gaussian case in more detail. The prior for the functions \u02c6f can be\nchosen arbitrarily, for example using some parametric approach. Here, we will use a nonparametric\napproach using Gaussian processes. The negative log-likelihood L := \u2212 ln p(D | \u02c6fX , \u02c6fY ) can be\nwritten in terms of the observational data D := {(x(n), y(n))}N\n\n(cid:12)(cid:12)(cid:12) .\nL = \u2212 N(cid:88)\nors for the causal mechanisms fX and fY , i.e., taking \u02c6x := fX (y) \u223c N(cid:0)0, KX (y)(cid:1) and\n\n(cid:0)x(i) \u2212 \u02c6fX (y(i))(cid:1) \u2212 N(cid:88)\n\n(cid:0)y(i) \u2212 \u02c6fY (x(i))(cid:1) \u2212 N(cid:88)\n\nAssuming Gaussian noise EX \u223c N (0, \u03c32\n\nY ) and using Gaussian Process pri-\n\nX ), EY \u223c N (0, \u03c32\n\n(cid:12)(cid:12)(cid:12)1 \u2212 \u02c6f(cid:48)\n\nY (x(i)) \u02c6f(cid:48)\n\nX (y(i))\n\nlog\n\ni=1\n\n\u03c0EY\n\ni=1\n\n\u03c0EX\n\ni=1\n\nn=1 as:\n\n4Note that if a certain model leads to independent noise terms, then adding more arrows will still allow\n\nindependent noise terms, by setting some functions to 0\u2014see also Figure 1 below.\n\n6\n\n\f\u02c6y := fY (x) \u223c N(cid:0)0, KY (x)(cid:1) where KX is the Gram matrix with entries KX;ij = kX (y(i), y(j))\n\nfor some covariance function kX : R2 \u2192 R, and similarly for KY , we obtain:\n\nmin\n\u02c6x, \u02c6y\n\nL = N log \u03c3X + N log \u03c3Y +\n(cid:107)y \u2212 \u02c6y(cid:107)2 +\n\n(cid:18) 1\n(cid:12)(cid:12)(cid:12)(cid:12)1 \u2212\n\n2\u03c32\nY\n\nlog\n\n(cid:18) \u2202kY\n\n\u2202x\n\n+ min\n\u02c6x, \u02c6y\n\n\u2212 N(cid:88)\n\ni=1\n\n(x(i), x)K\u22121\nY \u02c6y\n\nlog |KY |\n\nlog |KX| +\n\n1\n2\n(cid:107)x \u2212 \u02c6x(cid:107)2 +\n\n1\n2\n1\n2\u03c32\nX\n\n(cid:19)(cid:18) \u2202kX\n(cid:19)\n\n\u2202y\n\n(cid:18)\n\n1\n2\n\nX \u02c6x +\n\n\u02c6xT K\u22121\n\n1\n2\n(y(i), y)K\u22121\nX \u02c6x\n\n\u02c6yT K\u22121\nY \u02c6y\n\n(cid:33)\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n\n,\n\nwhere we used the expected derivatives of the Gaussian Processes for approximating the\ndeterminant-term. In our experiments, we used Gaussian covariance kernels\n\nkX (y, y(cid:48)) = \u03bb2\n\nX exp\n\n\u2212 (y \u2212 y(cid:48))2\n\n2\u03ba2\nX\n\n+ \u03c1\u03b4y,y(cid:48),\n\nand likewise for kY . Note that we added a small constant (\u03c1 = 10\u22124) to the diagonal to allow for\nsmall, independent measurement errors or rounding errors (which occur because the Gram matrices\nare very ill-conditioned). The optimization problem can be solved numerically, e.g., using standard\nmethods such as conjugate gradient or L-BFGS. We optimize simultaneously with respect to the\nnoise values \u02c6x, \u02c6y and the hyperparameters log \u03c3X, log \u03baX, log \u03bbX, log \u03c3Y , log \u03baY , log \u03bbY .\n\n5 Experiments\n\nWe illustrate the method on several synthetic data sets in Figure 1. Each row shows a data set with\nN = 500 data points. Because of space constraints, we only show the learned cyclic additive noise\nmodels, omitting the acyclic ones. In each case, we calculated the p-value for independence of the\ntwo noise variables using the HSIC (Hilbert-Schmidt Independence Criterion) test [15]; for p-values\nsubstantially above 0 (say larger than 1%), we do not reject the null hypothesis of independence and\nhence accept the model as possible causal explanation of the data. This happens in four out of six\ncases, except for the cases displayed in rows 1b and 3b, which are rejected.\nRows 1a and 1b concern the same data generated from a nonlinear and acyclic model. We found two\ndifferent local minima, one of which is accepted (the one more closely resembling the true model),\nand one is rejected. Even though we learned a causal model with cyclic structure, in the accepted\nsolution, one of the learned causal mechanisms becomes (almost) constant. Rows 3a and 3b show\nagain two different solutions for the same data, now generated from a nonlinear cyclic model. Note\nthat the solution in row 3b could be preferred over that in row 3a based upon its likelihood, but\nis actually rejected because its estimated noises are highly dependent. Row 4 shows data from a\nlinear, cyclic model, where the ratio of the noise sources equals the ratio of the slopes of the causal\nmechanisms. This makes this linear model part of the special class of unidenti\ufb01able additive noise\nmodels. In this case, the MAP estimates for the causal mechanisms are quite different from the true\nones.\n\n6 Discussion and Conclusion\n\nWe have studied a particular class of cyclic causal models given by nonlinear SEMs with additive\nnoise. We have discussed how these models can be interpreted to describe the equilibrium distribu-\ntion of a dynamic system with noise that is constant in time. We have looked in detail at the bivariate\nGaussian-noise case and shown generic identi\ufb01ability of the causal graph. We have also proposed a\nmethod to learn such models from observational data and illustrated it on synthetic data.\nEven though we have shown that in this \u201claboratory setting\u201d, the method can be made to work on\npurely observational data when enough data is available, it includes several assumptions that make it\nchallenging to apply in real-world scenarios. Also, from our experiments, it appears that the method\noften \ufb01nds other solutions (local minima of the log likelihood) which differ from the expected true\ndata generating model but which have dependent estimated noises.\nThus there is ample opportunity for future work: For example, improving the robustness of the\nlearning method, and generalizing the results to many variables and non-Gaussian noise.\n\n7\n\n\f1a:\n\n1b:\n\n2:\n\n3a:\n\n3b:\n\n4:\n\nFigure 1: From left to right: observed data pairs (x, y), true (blue) and estimated (red) functions fY and fX,\nrespectively, estimated noise values (eX , eY ) and reconstructed data (x, y) based on the estimated noise. Rows\n1a and 1b show two different solutions (minima of the log likelihood) for the same data, as do rows 3a and 3b.\nThe true models used to generate the data, the p-values for independence of the estimated residuals, and the\nnegative log-likelihoods are, from top to bottom:\n\n#\n\n1a\n1b\n2\n3a\n3b\n4\n\nIdenti\ufb01able?\n\nLinear?\n\nCyclic?\n\nfY (x)\n\nfX (y)\n\n+\n+\n+\n+\n+\n\u2212\n\n\u2212\n\u2212\n\u2212\n\u2212\n\u2212\n+\n\n\u2212\n\u2212\n\u2212\n+\n+\n+\n\n0.9 tanh(2x)\n0.9 tanh(2x)\n\n0\n\n0.9 cos(x)\n0.9 cos(x)\n\n\u22120.4x\n\n0\n0\n\n0.9 tanh(2x)\n0.9 tanh(y)\n0.9 tanh(y)\n\n0.8y\n\n\u03c3X\n\n1\n1\n0.5\n1\n1\n0.5\n\n\u03c3Y\n\n0.5\n0.5\n1\n1\n1\n1\n\npEX \u22a5\u22a5 EY\n7 \u00d7 10\u22123\n\n0.76\n\n0.74\n0.78\n\n3 \u00d7 10\u221258\n\n0.61\n\nL\n\n\u22122.56 \u00d7 103\n\u22122.51 \u00d7 103\n\u22122.57 \u00d7 103\n\u22122.24 \u00d7 103\n\u22122.26 \u00d7 103\n\u22122.73 \u00d7 103\n\nAcknowledgments\n\nWe thank Stefan Maubach and Wieb Bosma for their help with the computer algebra. DJ was supported by\nDFG, the German Research Foundation (SPP 1395). TH and JM were supported by NWO, the Netherlands\nOrganization for Scienti\ufb01c Research (VICI grant 639.023.604 and VENI grant 639.031.036, respectively).\n\n8\n\nXYDataXYX \u2212> fY(X)YXY \u2212> fX(Y)EXEYEstimated noiseXYReconstructed dataXYDataXYX \u2212> fY(X)YXY \u2212> fX(Y)EXEYEstimated noiseXYReconstructed dataXYDataXYX \u2212> fY(X)YXY \u2212> fX(Y)EXEYEstimated noiseXYReconstructed dataXYDataXYX \u2212> fY(X)YXY \u2212> fX(Y)EXEYEstimated noiseXYReconstructed dataXYDataXYX \u2212> fY(X)YXY \u2212> fX(Y)EXEYEstimated noiseXYReconstructed dataXYDataXYX \u2212> fY(X)YXY \u2212> fX(Y)EXEYEstimated noiseXYReconstructed data\fReferences\n[1] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.\n[2] K. A. Bollen. Structural Equations with Latent Variables. John Wiley & Sons, 1989.\n[3] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer-Verlag,\n\n1993. (2nd ed. MIT Press 2000).\n\n[4] C.W.J. Granger. Investigating causal relations by econometric models and cross-spectral meth-\n\nods. Econometrica, 37:424438, 1969.\n\n[5] N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic net-\nworks. In Proceedings of the Fourteenth Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI-98), pages 139\u2013147, 1998.\n\n[6] P. Spirtes. Directed cyclic graphical representations of feedback models.\n\nIn Proceedings of\n\nthe 11th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-95), page 491499, 1995.\n\n[7] T. Richardson. A discovery algorithm for directed cyclic graphs. In Proceedings of the Twelfth\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI-1996), 1996.\n\n[8] G. Lacerda, P. Spirtes, J. Ramsey, and P. O. Hoyer. Discovering cyclic causal models by\nIn Proceedings of the 24th Conference on Uncertainty in\n\nindependent components analysis.\nArti\ufb01cial Intelligence (UAI-2008), 2008.\n\n[9] M. Schmidt and K. Murphy. Modeling discrete interventional data using directed cyclic graph-\nical models. In Proceedings of the 25th Annual Conference on Uncertainty in Arti\ufb01cial Intel-\nligence (UAI-09), 2009.\n\n[10] S. Itani, M. Ohannessian, K. Sachs, G. P. Nolan, and M. A. Dahleh. Structure learning in\nIn JMLR Workshop and Conference Proceedings, volume 6, page\n\ncausal cyclic networks.\n165176, 2010.\n\n[11] P.O. Hoyer, D.Janzing, J.M.Mooij, J.Peters, and B.Sch\u00a8olkopf. Nonlinear causal discovery\nwith additive noise models. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,\nAdvances in Neural Information Processing Systems 21 (NIPS*2008), pages 689\u2013696, 2009.\nIdenti\ufb01ability of\nIn Proceedings of the 27th Annual Conference on\n\n[12] Jonas Peters, Joris M. Mooij, Dominik Janzing, and Bernhard Sch\u00a8olkopf.\n\ncausal graphs using functional models.\nUncertainty in Arti\ufb01cial Intelligence (UAI-11), 2011.\n\n[13] K. Zhang and A. Hyv\u00a8arinen. On the identi\ufb01ability of the post-nonlinear causal model. In Pro-\nceedings of the 25th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-09), Montreal,\nCanada, 2009.\n\n[14] A.D. Polyanin and V.F. Zaitsev. Handbook of Nonlinear Partial Differential Equations. Chap-\n\nman & Hall / CRC, 2004.\n\n[15] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Sch\u00a8olkopf. Kernel methods for\n\nmeasuring independence. Journal of Machine Learning Research, 6:2075\u20132129, 2005.\n\n9\n\n\f", "award": [], "sourceid": 449, "authors": [{"given_name": "Joris", "family_name": "Mooij", "institution": null}, {"given_name": "Dominik", "family_name": "Janzing", "institution": null}, {"given_name": "Tom", "family_name": "Heskes", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}]}