{"title": "Nonlinear directed acyclic structure learning with weakly additive noise models", "book": "Advances in Neural Information Processing Systems", "page_first": 1847, "page_last": 1855, "abstract": "The recently proposed \\emph{additive noise model} has advantages over previous structure learning algorithms, when attempting to recover some true data generating mechanism, since it (i) does not assume linearity or Gaussianity and (ii) can recover a unique DAG rather than an equivalence class. However, its original extension to the multivariate case required enumerating all possible DAGs, and for some special distributions, e.g. linear Gaussian, the model is invertible and thus cannot be used for structure learning. We present a new approach which combines a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. Experiments with synthetic and real data show that this method is more accurate than previous methods when data are nonlinear and/or non-Gaussian.", "full_text": "Nonlinear directed acyclic structure learning\n\nwith weakly additive noise models\n\nRobert E. Tillman\n\nArthur Gretton\n\nPeter Spirtes\n\nCarnegie Mellon University\n\nCarnegie Mellon University,\n\nCarnegie Mellon University\n\nPittsburgh, PA\n\nMPI for Biological Cybernetics\n\nPittsburgh, PA\n\nrtillman@cmu.edu\n\nPittsburgh, PA\n\nps7z@andrew.cmu.edu\n\narthur.gretton@gmail.com\n\nAbstract\n\nThe recently proposed additive noise model has advantages over previous\ndirected structure learning approaches since it (i) does not assume linearity\nor Gaussianity and (ii) can discover a unique DAG rather than its Markov\nequivalence class. However, for certain distributions, e.g. linear Gaussians,\nthe additive noise model is invertible and thus not useful for structure\nlearning, and it was originally proposed for the two variable case with a\nmultivariate extension which requires enumerating all possible DAGs. We\nintroduce weakly additive noise models, which extends this framework to\ncases where the additive noise model is invertible and when additive noise\nis not present. We then provide an algorithm that learns an equivalence\nclass for such models from data, by combining a PC style search using recent\nadvances in kernel measures of conditional dependence with local searches\nfor additive noise models in substructures of the Markov equivalence class.\nThis results in a more computationally e\ufb03cient approach that is useful for\narbitrary distributions even when additive noise models are invertible.\n\n1 Introduction\n\nLearning probabilistic graphical models from data serves two primary purposes: (i) \ufb01nd-\ning compact representations of probability distributions to make inference e\ufb03cient and (ii)\nmodeling unknown data generating mechanisms and predicting causal relationships. Until\nrecently, most constraint-based and score-based algorithms for learning directed graphical\nmodels from continuous data required assuming relationships between variables are linear\nwith Gaussian noise. While this assumption may be appropriate in many contexts, there are\nwell known contexts, such as fMRI images, where variables have nonlinear dependencies and\ndata do not tend towards Gaussianity. A second major limitation of the traditional algo-\nrithms is they cannot identify a unique structure; they reduce the set of possible structures\nto an equivalence class which entail the same Markov properties. The recently proposed ad-\nditive noise model [1] for structure learning addresses both limitations; by taking advantage\nof observed nonlinearity and non-Gaussianity, a unique directed acyclic structure can be\nidenti\ufb01ed in many contexts. However, it too su\ufb00ers from limitations: (i) for certain distri-\nbutions, e.g. linear Gaussians, the model is invertible and not useful for structure learning;\n(ii) it was originally proposed for two variables with a multivariate extension that requires\nenumerating all possible DAGs, which is super-exponential in the number of variables.\n\nIn this paper, we address the limitations of the additive noise model. We introduce weakly\nadditive noise models, which have the advantages of additive noise models, but are still\nuseful when the additive noise model is invertible and in most cases when additive noise is\nnot present. Weakly additive noise models allow us to express greater uncertainty about the\n\n1\n\n\fdata generating mechanism, but can still identify a unique structure or a smaller equivalence\nclass in most cases. We also provide an algorithm for learning an equivalence class for such\nmodels from data that is more computationally e\ufb03cient in the more than two variables case.\nSection 2 reviews the appropriate background; section 3 introduces weakly additive noise\nmodels; section 4 describes our learning algorithm; section 5 discusses some related research;\nsection 6 presents some experimental results; \ufb01nally, section 7 o\ufb00ers conclusions..\n\n2 Background\n\nG denotes the parents of Vi and ChVi\n\nLet G = hV, Ei be a directed acyclic graph (DAG), where V denotes the set of vertices and\nEij \u2208 E denotes a directed edge Vi \u2192 Vj. Vi is a parent of Vj and Vj is a child of Vi. For\nVi \u2208 V, PaVi\nG denotes the children of Vi. The degree of Vi\nis the number of edges with an endpoint at Vi. A v-structure is a triple hVi, Vj, Vki \u2286 V such\nVj\nG . A v-structure is immoral, or an immorality, if Eik /\u2208 E and Eki /\u2208 E.\nthat {Vi, Vk} \u2286 Pa\nA joint distribution P over variables corresponding to nodes in V is Markov with respect to\nG (cid:17). P is faithful to G if every conditional independence true\nG if PP (V) = Y\nin P is entailed by the above factorization. A partially directed acyclic graph (PDAG) H for\nG is a mixed graph, i.e. consisting of directed and undirected edges, representing all DAGs\nMarkov equivalent to G, i.e. DAGs entailing exactly the same conditional independencies.\nIf Vi \u2192 Vj is a directed edge in H, then all DAGs Markov equivalent to G have this directed\nedge; if Vi \u2212 Vj is an undirected edge in H, then some DAGs that are Markov equivalent to\nG have the directed edge Vi \u2192 Vj while others have the directed edge Vi \u2190 Vj.\n\nPP (cid:16)Vi | PaVi\n\nVi\u2208V\n\nThe PC algorithm is a well known constraint-based, or conditional independence based,\nstructure learning algorithm. It is an improved greedy version of the SGS [2] and IC [3]\nInstead of searching all subsets of V\\{Vi, Vj} for an S such\nalgorithms, shown below.\n\nInput : Observed data for variables in V\nOutput: PDAG G over nodes V\nG \u2190 the complete undirected graph over the variables in V\nFor {Vi, Vj} \u2286 V, if \u2203S \u2286 V\\{Vi, Vj}, such that Vi \u22a5\u22a5 Vj | S, remove the Vi \u2212 Vj edge\nFor {Vi, Vj, Vk} \u2286 V such that Vi \u2212 Vj and Vj \u2212 Vk remain as edges, but Vi \u2212 Vk does\nnot remain, if \u2204S \u2286 V\\{Vi, Vj, Vk}, such that Vi \u22a5\u22a5 Vk | {S \u222a Vj}, orient Vi \u2192 Vj \u2190 Vk\nOrient edges to prevent additional immoralities and cycles using the Meek rules [4]\n\n1\n\n2\n\n3\n\n4\n\nAlgorithm 1: SGS/IC algorithm\n\nthat Vi \u22a5\u22a5 Vj | S, PC (i) initially sets S = \u2205 for all {Vi, Vj} pairs, (ii) checks to see if any\nedges can be removed based on the results of conditional independence tests with these S\nsets, and (iii) iteratively increases the cardinality of S considered until \u2204Vk \u2208 V with degree\ngreater than |S|. S is only considered if it is a subset of nodes connected to Vi or Vj at the\ncurrent iteration. PC learns the correct PDAG in the large sample limit when the Markov,\nfaithfulness, and causal su\ufb03ciency (that there are no unmeasured common causes of two\nor more measured variables) assumptions hold [2]. The partial correlation based Fisher\nZ-transformation test, which assumes linear Gaussian distributions, is used for conditional\nindependence testing with continuous variables. The statistical advantage of PC is it limits\nthe number of tests performed, particularly those with large conditioning sets. This also\nyields a computational advantage since the number of possible tests is exponential in |V|.\n\nThe recently proposed additive noise model approach to structure learning [1] assumes only\nthat each variable can be represented as a (possibly nonlinear) function f of its parents\nplus additive noise \u01eb with some arbitrary distribution, and that the noise components are\n\nn\n\nmutually independent, i.e. P(\u01eb1, . . . , \u01ebn) =\n\nY\n\ni=1\n\nP(\u01ebi). Consider the two variable case where\n\nX \u2192 Y is the true DAG, X = \u01ebX , Y = sin(\u03c0X) + \u01ebY , \u01ebX \u223c U nif (\u22121, 1), and \u01ebY \u223c\nU nif (\u22121, 1). If we regress Y on X (nonparametrically), the forward model, \ufb01gure 1a, and\n\n2\n\n\fY\n\n2\n1.5\n1\n0.5\n0\n\u22120.5\n\u22121\n\u22121.5\n\u22122\n\u22121\n\n1\n\n0.5\n\nX\n\n0\n\n\u22120.5\n\n0.5\n\n1\n\n\u22120.5\n\n0\nX\n(a)\n\n\u22121\n\u22122\u22121.5\u22121\u22120.5 0 0.5 1 1.5 2\n\nY\n(b)\n\nZ\n\n10\n8\n6\n4\n2\n0\n\u22122\n\u22124\n\u22126\n\u22128\n\u22121\n\n1\n\n0.5\n\nX\n\n0\n\n\u22120.5\n\n0.5\n\n1\n\n\u22120.5\n\n0\nX\n(c)\n\n\u22121\n\u22128 \u22126 \u22124 \u22122 0 2 4 6 8 10\n\nZ\n(d)\n\nFigure 1: Nonparametric regressions with data overlayed for (a) Y regressed on X, (b) X\nregressed on Y , (c) Z regressed on X, and (d) X regressed on Z\n\nregress X on Y , the backward model, \ufb01gure 1b, we observe the residuals \u02c6\u01ebY \u22a5\u22a5 X and\n\u02c6\u01ebX /\u22a5\u22a5 Y . This provides a criterion for distinguishing X \u2192 Y from X \u2190 Y in many cases,\nbut there are counterexamples such as the linear Gaussian case, where the forward model\nis invertible so we \ufb01nd \u02c6\u01ebY \u22a5\u22a5 X and \u02c6\u01ebX \u22a5\u22a5 Y .\n[1, 5] show, however, that whenever f is\nnonlinear, the forward model is noninvertible, and when f is linear, the forward model\nis only invertible when \u01eb is Gaussian and a few other special cases. Another limitation\nof this approach is that it is not closed under marginalization of intermediary variables\nfor X \u2192 Y \u2192 Z with X = \u01ebX , Y = X 3 + \u01ebY , Z = Y 3 + \u01ebZ ,\nwhen f is nonlinear, e.g.\n\u01ebX \u223c U nif (\u22121, 1), \u01ebY \u223c U nif (\u22121, 1), and \u01ebZ \u223c U nif (0, 1), observing only X and Z,\n\ufb01gures 1c and 1d, causes us to reject both the forward and backward models. [5] shows this\nmethod can be generalized to more variables. To test whether a DAG is compatible with\nthe data, we regress each variable on its parents and test whether the resulting residuals are\nmutually independent. This procedure is impractical even for a few variables, however, since\nthe number of possible DAGs grows super-exponentially with the number of variables, e.g.\nthere are \u2248 4.2 \u00d7 1018 DAGs with 10 nodes. Since we do not assume linearity or Gaussianity\nin this framework, a su\ufb03ciently powerful nonparametric independence test must be used.\nTypically, the Hilbert Schmidt Independence Criterion [6] is used, which we now de\ufb01ne.\nLet X be a random variable with domain X . A Hilbert space HX of functions from X to R\nis a reproducing kernel Hilbert space (RKHS) if for some kernel k(\u00b7, \u00b7) (the reproducing kernel\nfor HX ), for every f (\u00b7) \u2208 HX and x \u2208 X , the inner product hf (\u00b7), k(x, \u00b7)iHX = f (x). We may\ntreat k(x, \u00b7) as a mapping of x to the feature space HX . For x, x\u2032 \u2208 X , hk(x, \u00b7), k(x\u2032, \u00b7)iHX =\nk(x, x\u2032), so we can compute inner products e\ufb03ciently in this high dimensional space. The\nMoore-Aronszajn theorem shows that all symmetric positive de\ufb01nite kernels (most popular\nkernels) are reproducing kernels that uniquely de\ufb01ne corresponding RKHSs [7]. Let Y be\na random variable with domain Y and l(\u00b7, \u00b7) the reproducing kernel for HY . We de\ufb01ne the\nmean map \u00b5X and cross covariance CXY as follows, using \u2297 to denote the tensor product.\n\n\u00b5X = EX [k(x, \u00b7)]\n\nCXY = ([k(x, \u00b7) \u2212 \u00b5X ] \u2297 [l(y, \u00b7) \u2212 \u00b5Y ])\n\nIf the kernels are characteristic, e.g. Gaussian and Laplace kernels, the mean map is injective\n[8, 9, 10] so distinct probability distributions have di\ufb00erent mean maps. The Hilbert Schmidt\nIndependence Criteria (HSIC) HXY = kCXY k2\nHS measures the dependence of X and Y ,\nwhere k \u00b7 kHS denotes the Hilbert Schmidt norm. [9] shows HXY = 0 if and only if X \u22a5\u22a5 Y\nfor characteristic kernels. For m paired i.i.d. samples, let K and L be Gram matrices for\nN , let \u02dcK = HKH and \u02dcL = HLH\nk(\u00b7, \u00b7) and l(\u00b7, \u00b7), i.e. kij = k(xi, xj). For H = IN \u2212 1\nm2 tr (cid:16) \u02dcK \u02dcL(cid:17), where tr denotes the trace, is an empirical\nbe centered Gram matrices. \u02c6HXY =\nestimator for HXY [6]. To determine the threshold of a level-\u03b1 statistical test, we can use\nthe permutation approach (where we compute \u02c6HXY for multiple random assignments of the\nY samples to X, and use the 1 \u2212 \u03b1 quantile of the resulting empirical distribution over\n\u02c6HXY ), or a Gamma approximation to the null distribution of m \u02c6HXY (see [6] for details).\n\nN 1N 1T\n\n1\n\n3 Weakly additive noise models\n\nWe now extend the additive noise model framework to account for cases where additive\nnoise models are invertible and cases where additive noise may not be present.\n\n3\n\n\fG E is a local additive noise model for a distribution P over V\n\nDe\ufb01nition 3.1. \u03c8 = DVi, PaVi\nthat is Markov to a DAG G = hV, Ei if Vi = f (cid:16)PaVi\nDe\ufb01nition 3.2. A weakly additive noise model M = hG, \u03a8i for a distribution P over V is a\nDAG G = hV, Ei and set of local additive noise models \u03a8, such that P is Markov to G, \u03c8 \u2208 \u03a8\nif and only if \u03c8 is a local additive noise model for P, and \u2200DVi, PaVi\nG E \u2208 \u03a8, \u2204Vj \u2208 PaVi\nG\nVj\nsuch that there exists some graph G\u2032 (not necessarily related to P) such that Vi \u2208 Pa\nG \u2032 and\nDVj, Pa\n\nG \u2032E is a local additive noise model for P.\n\nG (cid:17) + \u01eb is an additive noise model.\n\nVj\n\nWhen we assume a data generating process has a weakly additive noise model representation,\nwe assume only that there are no cases where X \u2192 Y can be written X = f (Y ) + \u01ebX , but\nnot Y = f (X) + \u01ebY .\nIn other words, the data cannot appear as though it admits an\nadditive noise model representation, but only in the incorrect direction. This representation\nis still appropriate when additive noise models are invertible, and when additive noise is\nnot present: such cases only lead to weakly additive noise models which express greater\nunderdetermination of the true data generating process.\n\nWe now de\ufb01ne the notion of distribution-equivalence for weakly additive noise models.\n\nDe\ufb01nition 3.3. A weakly additive noise model M = hG, \u03a8i is distribution-equivalent to\nN = hG\u2032, \u03a8\u2032i if and only if G and G\u2032 are Markov equivalent and \u03c8 \u2208 \u03a8 if and only if \u03c8 \u2208 \u03a8\u2032.\n\nDistribution-equivalence de\ufb01nes what can be discovered about the true data generating\nmechanism using observational data. We now de\ufb01ne a new structure to partition data\ngenerating processes which instantiate distribution-equivalent weakly additive noise models.\n\nDe\ufb01nition 3.4. A weakly additive noise partially directed acyclic graph (WAN-PDAG) for\nM = hG, \u03a8i is a mixed graph H = hV, Ei such that for {Vi, Vj} \u2286 V,\n\n1. Vi \u2192 Vj is a directed edge in H if and only if Vi \u2192 Vj is a directed edge in G and\n\nin all G\u2032 such that N = hG\u2032, \u03a8\u2032i is distribution-equivalent to M\n\n2. Vi \u2212 Vj is an undirected edge in H if and only if Vi \u2192 Vj is a directed edge in G and\nthere exists a G\u2032 and N = hG\u2032, \u03a8\u2032i distribution-equivalent to M such that Vi \u2190 Vj\nis a directed edge in G\u2032\n\nWe now get the following results.\nLemma 3.1. Let M = hG, \u03a8i be a weakly additive noise model, DVi, PaVi\nG \u2032 and ChVi\nN = hG\u2032, \u03a8\u2032i be distribution equivalent to M. Then PaVi\n\nG = PaVi\n\nG E \u2208 \u03a8, and\nG = ChVi\nG \u2032.\n\nProof. Since M and N are distribution-equivalent, PaVi\n\nG = PaVi\n\nG \u2032. Thus, ChVi\n\nG = ChVi\nG \u2032\n\nTheorem 3.1. The WAN-PDAG for M = hG, \u03a8i is constructed by (i) adding all directed\nand undirected edges in the PDAG instantiated by M, (ii) \u2200DVi, PaVi\nG E \u2208 \u03a8, directing all\nVj \u2208 PaVi\nG as Vi \u2192 Vk, and (iii) applying the extended Meek\nrules [4], treating orientions made using \u03a8 as background knowledge.\n\nG as Vj \u2192 Vi and all Vk \u2208 ChVi\n\nProof. (i) This is correct because of Markov equivalence [2]. (ii) This is correct by lemma\n3.1. (iii) These rules are correct and complete [4].\n\nWAN-PDAGs can used to identify the same information about the data generating mech-\nanism as additive noise models, when additive noise models are identi\ufb01able, but provide\na more powerful representation of uncertainty and can be used to discover more informa-\ntion when additive noise models are unidenti\ufb01able. The next section describes an e\ufb03cient\nalgorithm for learning WAN-PDAGs from data.\n\n4\n\n\f4 The Kernel PC (kPC) algorithm\n\nWe now describe the Kernel PC (kPC) algorithm1, which consists of two stages: (i) a\nconstraint-based search using the PC algorithm with a nonparametric conditional inde-\npendence test (the Fisher Z test is inappropriate since we want to allow nonlinearity and\nnon-Gaussianity) to identify the Markov equivalence class and (ii) a \u201cPC-style\u201d search for\nnoninvertible additive noise models in submodels of the Markov equivalence class.\n\nIn the \ufb01rst stage, we use a kernel-based conditional dependence measure similar to HSIC\n[9] (see also [11, Section 2.2] for a related quantity with a di\ufb00erent normalization). For\na conditioning variable Z with centered Gram matrix \u02dcM for a reproducing kernel m(\u00b7, \u00b7),\nZZCZ \u00a8Y , where \u00a8X = (X, Z) and\nwe de\ufb01ne the conditional cross covariance CXY |Z = C \u00a8XZC\u22121\n\u00a8Y = (Y, Z). Let HXY |Z = kCXY |Zk2\nHS. It follows from [9, Theorem 3] that HXY |Z = 0 if\nand only if X \u22a5\u22a5 Y |Z when kernels are characteristic. [9] provides the empirical estimator:\n\n\u02c6HXY |Z =\n\n1\nm2 tr( \u02dcK \u02dcL \u2212 2 \u02dcK \u02dcM ( \u02dcM + \u01ebIN )\u22122 \u02dcM \u02dcL + \u02dcK \u02dcM ( \u02dcM + \u01ebIN )\u22122 \u02dcM \u02dcL \u02dcM ( \u02dcM + \u01ebIN )\u22122 \u02dcM )\nThe null distribution of \u02c6HXY |Z is unknown and di\ufb03cult to derive so we must use the\npermutation approach described in section 2. This is not straightforward since permuting\nX or Y while leaving Z \ufb01xed changes the marginal distribution of X given Z or Y given Z.\nWe thus (making analogy to the discrete case) must cluster Z and then permute elements\nonly within clusters for the permutation test, as in [12].\nThis \ufb01rst stage is not computational e\ufb03cient, however, since each evaluation of \u02c6HXY |Z is\nnaively O (cid:0)N 3(cid:1) and we need to evaluate \u02c6HXY |Z approximately 1000 times for each per-\nmutation test. Fortunately, we see from [13, Appendix C] that the eigenspectra of Gram\nmatrices for Gaussian kernels decay very rapidly, so low rank approximations of these ma-\ntrices can be obtained even when using a very conservative threshold. We implemented the\nincomplete Cholesky factorization [14], which can be used to obtain an m \u00d7 p matrix G,\nwhere p \u226a m, and an m \u00d7 m permutation matrix P such that K \u2248 P GG\u22a4P \u22a4, where K is\nan m \u00d7 m Gram matrix. A clever implementation after replacing Gram matrices in \u02c6HXY |Z\nwith their incomplete Cholesky factorizations and using an appropriate equivalence to invert\nG\u22a4G + \u01ebIp (cid:16)for \u02dcM(cid:17) instead of GG\u22a4 + \u01ebIm results in a straightforward O (cid:0)mp3(cid:1) operation.\n\nUnfortunately, this is not numerically stable unless a relatively large regularizer \u01eb is chosen\nor only a small number of columns are used in the incomplete Cholesky factorizations.\n\nA more stable (and faster) approach is to obtain incomplete Cholesky factorizations GX , GY ,\nand GZ with permutation matrices PX , PY , and PZ, and then obtain the thin SVDs for\nHPX GX , HPY GY , and HPZGZ, e.g HP G = U SV , where U is m \u00d7 p, S is the p \u00d7 p\ndiagonal matrix of singular values, and V is p \u00d7 p. Now de\ufb01ne matrices \u00afSX , \u00afSY , and \u00afSZ\nand \u00afGX , \u00afGY , and \u00afGZ as follows:\n\n2\n\nii = (cid:0)sX\n\u00afsX\nii (cid:1)\n\n2\n\nii = (cid:0)sY\n\u00afsY\nii(cid:1)\n\nii(cid:1)2\nii = (cid:0)sZ\n\u00afsZ\nii(cid:1)2\n(cid:0)sZ\n\u00afGZ = U Z \u00afSZU Z \u22a4\n\n+ \u01eb\n\n\u00afGX = U X \u00afSX U X \u22a4\n\n\u00afGY = U Y \u00afSY U Y \u22a4\n\n1\nWe can compute \u02c6HXY |Z =\nm2 tr (cid:0) \u00afGX \u00afGY \u2212 2 \u00afGX \u00afGZ \u00afGY + \u00afGX \u00afGZ \u00afGY \u00afGZ(cid:1) stably and e\ufb03-\nciently in O (cid:0)mp3(cid:1) by choosing an appropriate associative ordering of matrix multiplications.\nFigure 2 shows that this method leads to a signi\ufb01cant increase in speed when used with a\npermutation test for conditional independence without signi\ufb01cantly a\ufb00ecting the empirically\nobserved type I error rate for a level-.05 test.\n\nIn the second stage, we look for additive noise models in submodels of the Markov equiv-\nalence class because (i) it may be more e\ufb03cient to do so and require fewer tests since\norientations implied by an additive noise model may imply further orientations and (ii) we\n\n1MATLAB code may be obtained from http://www.andrew.cmu.edu/\u223crtillman/kpc\n\n5\n\n\f)\ns\ne\n\nt\n\nu\nn\nm\n\ni\n\n(\n \n\ne\nm\nT\n\ni\n\n20\n\n15\n\n10\n\n5\n\n0\n\n \n\nNaive\nIncomplete Cholesky + SVD\n\n \n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\n \nI\n \n\ne\np\ny\nT\n\n \nl\n\na\nc\ni\nr\ni\np\nm\nE\n\n200\n\n400\n\n600\n\n800\n\n1000\n\nSample Size\n\nNaive\nIncomplete Cholesky + SVD\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n200\n\n400\n\n600\n\n800\n\n1000\n\nSample Size\n\nFigure 2: Runtime and Empirical Type I Error Rate. Results are over the generation of 20\n3-node DAGs for which X \u22a5\u22a5 Y |Z and the generating distribution was Gaussian.\n\nmay \ufb01nd more orientations by considering submodels, e.g. if all relations are linear and only\none variable has a non-Gaussian noise term. The basic strategy used is a\u201cPC-style\u201d greedy\nsearch where we look for undirected edges in the current mixed graph (starting with the\nPDAG resulting from the \ufb01rst stage) adjacent to the fewest other undirected edges. If these\nedges can be oriented using additive noise models, we make the implied orientations, apply\nthe extended Meek rules, and then iterate until no more edges can be oriented. Algorithm\n2 provides pseudocode. Let G = hV, Ei be the resulting PDAG and \u2200Vi \u2208 V, let UVi\nG denote\nthe nodes connected to Vi in G by an undirected edge. We get the following results.\n\nInput : PDAG G = hV, Ei\nOutput: WAN-PDAG G = hV, Ei\ns \u2190 1\n\nwhile max\n\nUVi\n\n\u2265 s do\n\nG (cid:12)(cid:12)(cid:12)\n\nVi\u2208V (cid:12)(cid:12)(cid:12)\nforeach Vi \u2208 V such that (cid:12)(cid:12)(cid:12)\n\ns\u2032 \u2190 s\nwhile s\u2032 > 0 do\n\nUVi\n\nG (cid:12)(cid:12)(cid:12)\n\n= s or (cid:12)(cid:12)(cid:12)\n\nUVi\n\nG (cid:12)(cid:12)(cid:12)\n\n< s and UVi\n\nG was updated do\n\nforeach S \u2286 UVi\nnot create an immorality do\n\nG such that |S| = s\u2032 and \u2200Sk \u2208 S, orienting Sk \u2192 Vi, does\n\nNonparametrically regress Vi on PaVi\nG \u222a S and compute the residual \u02c6\u01ebiS\nVj\nif \u02c6\u01ebiS \u22a5\u22a5 S and \u2204Vj \u2208 S and S\u2032 \u2286 U\nG such that. regressing Vj on\n\u222a S\u2032 \u222a Vi results in the residual \u02c6\u01ebjS\u2032\u222a{Vi} \u22a5\u22a5 S\u2032 \u222a {Vi} then\nPaG\nVj\n\u2200Sk \u2208 S, orient Sk \u2192 Vi, and \u2200Ul \u2208 UVi\nApply the extended Meek rules\n\u2200Vm \u2208 V, update UVm\n\nG \\S orient Vi \u2192 Ul\n\nG , set s\u2032 = 1, and break\n\nend\n\nend\ns\u2032 \u2190 s\u2032 \u2212 1;\n\nend\n\nend\ns \u2190 s + 1\n\nend\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\n13\n\n14\n\n15\n\n16\n\n17\n\n18\n\nAlgorithm 2: Second Stage of kPC\n\nLemma 4.1. If an edge is oriented in the second stage of kPC, it is implied by a noninvertible\nlocal additive noise model.\n\nProof. If the condition at line 8 is true then DVi, PaVi\nnoise model. All Ul \u2208 UVi\n\nG \\S must be children of Vi by lemma 3.1.\n\nG \u222a SE is a noninvertible local additive\n\n6\n\n\fkPC\nPC\nGES\nLiNGAM\n400\n\n200\n\n600\n\nSample Size\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\nl\nl\n\na\nc\ne\nR\n\n0\n\n \n\n200\n\n400\n\n600\n\nSample Size\n\n \n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\n800\n\n1000\n\n \n\nkPC\nPC\nGES\nLiNGAM\n800\n\n1000\n\nl\nl\n\na\nc\ne\nR\n\nkPC\nPC\nGES\nLiNGAM\n400\n\n200\n\n600\n\nSample Size\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n200\n\n400\n\n600\n\nSample Size\n\n \n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\n\n800\n\n1000\n\n \n\nkPC\nPC\nGES\nLiNGAM\n800\n\n1000\n\nl\nl\n\na\nc\ne\nR\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nkPC\nPC\nGES\nLiNGAM\n400\n\n200\n\n600\n\nSample Size\n\nkPC\nPC\nGES\nLiNGAM\n\n0\n\n \n\n200\n\n400\n\n600\n\nSample Size\n\n \n\n800\n\n1000\n\n \n\n800\n\n1000\n\nLinear Gaussian\n\nLinear Non-Gaussian\n\nNonlinear Non-Gaussian\n\nFigure 3: Precision and Recall\n\nLemma 4.2. Suppose \u03c8 = hVi, Wi is a noninvertible local additive noise model. Then kPC\nwill make all orientations implied by \u03c8.\n\nProof. Let \u02dcS = W\\PaG\nVi\nsince |\u02dcS| \u2264 |UVi\nadditive noise model, line 8 is satis\ufb01ed so all edges connected to Vi are oriented.\n\nat the current iteration. kPC must terminate with s > |\u02dcS|\nG \u222a \u02dcSE is a noninvertible local\n\nG | so S = \u02dcS at some iteration. Since DVi, PaVi\n\nfor PaG\nVi\n\nTheorem 4.1. Assume data is generated according to some weakly additive noise model\nM = hG, \u03a8i. Then kPC will return the WAN-PDAG instantiated by M assuming perfect\nconditional independence information, Markov, faithfulness, and causal su\ufb03ciency.\n\nProof. The PC algorithm is correct and complete with respect to conditional independence\n[2]. Orientations made with respect to additive noise models are correct by lemma 4.1 and\nall such orientations that can be made are made by lemma 4.2. The Meek rules, which are\ncorrect and complete [4], are invoked after each orientation made with respect to additive\nnoise models so they are invoked after all such orientations are made.\n\n5 Related research\n\nkPC is similar in spirit to the PC-LiNGAM structure learning algorithm [15], which assumes\ndependencies are linear with either Gaussian or non-Gaussian noise. PC-LiNGAM combines\nthe PC algorithm with LiNGAM to learn structures referred to as ngDAGs. KCL [11] is\na heuristic search for a mixed graph that uses the same kernel-based dependence measures\nas kPC (while not determining signi\ufb01cance threhsholds via a hypothesis test), but does\nnot take advantage of additive noise models.\n[16] provides a more e\ufb03cient algorithm for\nlearning additive noise models, by \ufb01rst \ufb01nding a causal ordering after doing a series of\nhigh dimensional regressions and HSIC independence tests and then pruning the resulting\nDAG implied by this ordering. Finally, [17] proposes a two-stage procedure for learning\nadditive noise models from data that is similar to kPC, but requires the additive noise\nmodel assumptions in the \ufb01rst stage where the Markov equivalence class is identi\ufb01ed.\n\n6 Experimental results\n\nTo evaluate kPC, we generated 20 random 7-nodes DAGs using the MCMC algorithm in [18]\nand sampled 1000 data points from each DAG under three conditions: linear dependencies\n\n7\n\n\fI\n\nLOCC\n\nLACC\n\nLIFG\n\nLIPL\n\nLMTG\n\nI\n\nLOCC\n\nLACC\n\nLIFG\n\nLIPL\n\nLMTG\n\nkPC\n\niMAGES\n\nFigure 4: Structures learned by kPC and iMAGES\n\nwith Gaussian noise, linear dependencies with non-Gaussian noise, and nonlinear dependen-\ncies with non-Gaussian noise. We generated non-Gaussian noise using the same procedure\nas [19] and used polynomial and trigonometric functions for nonlinear dependencies.\n\nWe compared kPC to PC, the score-based GES with the BIC-score [20], and the ICA-based\nLiNGAM [19], which assumes linear dependencies and non-Gaussian noise. We applied two\nmetrics in measuring performance vs sample size: precision, i.e. proportion of directed edges\nin the resulting graph that are in the true DAG, and recall, i.e. proportion of directed edges\nin the true DAG that are in the resulting graph. Figure 3 reports the results. In the linear\nGaussian case, we see PC shows slightly better performance than kPC in precision, which is\nunsurprising since PC assumes linear Gaussian distributions. Only LiNGAM shows better\nrecall, but worse precision. LiNGAM performs signi\ufb01cantly better than the other algorithms\nin the linear non-Gaussian case. kPC performs about the same as PC in precision and recall,\nwhich again is unsurprising since previous simulation results have shown that nonlinearity,\nbut not non-Gaussianity can signi\ufb01cantly a\ufb00ect the performance of PC. In the nonlinear\nnon-Gaussian case, kPC performs slightly better than PC in precision. We note, however,\nthat in some of these cases the performance of kPC was signi\ufb01cantly better.2\n\nWe also ran kPC on data from an fMRI experiment that is analyzed in [21] where nonlinear\ndependencies can be observed. Figure 4 shows the structure that kPC learned, where each\nof the nodes corresponds to a particular brain region. This structure is the same as the one\nlearned by the (GES-style) iMAGES algorithm in [21] except for the absence of one edge.\nHowever, iMAGES required background knowledge to direct the edges. kPC successfully\nfound the same directed edges without using any background knowledge. Domain experts\nin neuroscience have con\ufb01rmed the plausibility of the observed relationships.\n\n7 Conclusion\n\nWe introduced weakly additive noise models, which extend the additive noise model frame-\nwork to cases such as the linear Gaussian, where the additive noise model is invertible and\nthus unidenti\ufb01able, as well as cases where additive noise is not present. The weakly additive\nnoise framework allows us to identify a unique DAG when the additive noise model assump-\ntions hold, and a structure that is at least as speci\ufb01c as a PDAG (possibly still a unique\nDAG) when some additive noise assumptions fail. We de\ufb01ned equivalence classes for such\nmodels and introduced the kPC algorithm for learning these equivalence classes from data.\nFinally, we found that the algorithm performed well on both synthetic and real data.\n\nAcknowledgements\n\nWe thank Dominik Janzing and Bernhard Sch\u00a8olkopf for helpful comments. RET was funded\nby a grant from the James S. McDonnel Foundation. AG was funded by DARPA IPTO\nFA8750-09-1-0141, ONR MURI N000140710747, and ARO MURI W911NF0810242.\n\n2When simulating nonlinear data, we must be careful to ensure that variances do not blow up\nand result in data for which no \ufb01nite sample method can show adequate performance. This has the\nunfortunate side e\ufb00ect that the nonlinear data generated may be well approximated using linear\nmethods. Future research will consider more sophisticated methods for simulating data that is more\nappropriate when comparing kPC to linear methods.\n\n8\n\n\fReferences\n\n[1] P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Sch\u00a8olkopf. Nonlinear causal\ndiscovery with additive noise models. In Advances in Neural Information Processing\nSystems 21, 2009.\n\n[2] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. 2nd\n\nedition, 2000.\n\n[3] J. Pearl. Causality: Models, Reasoning, and Inference. 2000.\n[4] C. Meek. Causal inference and causal explanation with background knowledge.\nProceedings of the 11th Conference on Uncertainty in Arti\ufb01cial Intelligence, 1995.\n\nIn\n\n[5] K. Zhang and A. Hyv\u00a8arinen. On the identi\ufb01ability of the post-nonlinear causal model.\n\nIn Proceedings of the 26th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[6] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Sch\u00a8olkopf, and A. J. Smola. A kernel\nstatistical test of independence. In Advances in Neural Information Processing Systems\n20, 2008.\n\n[7] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American\n\nMathematical Society, 68(3):337404, 1950.\n\n[8] A. Gretton, K. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. Smola. A kernel method\nfor the two-sample-problem. In Advances in Neural Information Processing Systems\n19, 2007.\n\n[9] K. Fukumizu, A. Gretton, X. Sun, and B. Sch\u00a8olkopf. Kernel measures of conditional\n\ndependence. In Advances in Neural Information Processing Systems 20, 2008.\n\n[10] B. Sriperumbudur, A. Gretton, K. Fukumizu, G. Lanckriet, and B. Sch\u00a8olkopf. Injective\nhilbert space embeddings of probability measures. In Proceedings of the 21st Annual\nConference on Learning Theory, 2008.\n\n[11] X. Sun, D. Janzing, B. Scholk\u00a8opf, and K. Fukumizu. A kernel-based causal learning\nalgorithm. In Proceedings of the 24th International Conference on Machine Learning,\n2007.\n\n[12] X. Sun. Causal inference from statistical data. PhD thesis, Max Plank Institute for\n\nBiological Cybernetics, 2008.\n\n[13] F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of\n\nMachine Learning Research, 3:1\u201348, 2002.\n\n[14] S. Fine and K. Scheinberg. E\ufb03cient SVM training using low-rank kernel representa-\n\ntions. Journal of Machine Learning Research, 2:243\u2013264, 2001.\n\n[15] P. O. Hoyer, A. Hyv\u00a8arinen, R. Scheines, P. Spirtes, J. Ramsey, G. Lacerda, and\nS. Shimizu. Causal discovery of linear acyclic models with arbitrary distributions.\nIn Proceedings of the 24th Conference on Uncertainty in Arti\ufb01cial Intelligence, 2008.\n\n[16] J. M. Mooij, D. Janzing, J. Peters, and B. Scholk\u00a8opf. Regression by dependence mini-\nmization and its application to causal inference in additive noise models. In Proceedings\nof the 26th International Conference on Machine Learning, 2009.\n\n[17] K. Zhang and A. Hyv\u00a8arinen. Acyclic causality discovery with additive noise: An\ninformation-theoretical perspective. In Proceedings of the European Conference on Ma-\nchine Learning and Principles and Practice of Knowledge Discovery in Databases 2009,\n2009.\n\n[18] G. Melan\u00b8con, I. Dutour, and M. Bousquet-M\u00b4elou. Random generation of dags for\ngraph drawing. Technical Report INS-R0005, Centre for Mathematics and Computer\nSciences, 2000.\n\n[19] S. Shimizu, P. Hoyer, A. Hyv\u00a8arinen, and A. Kerminen. A linear non-gaussian acyclic\nmodel for causal discovery. Journal of Machine Learning Research, 7:1003\u20132030, 2006.\n[20] D. M. Chickering. Optimal structure identi\ufb01cation with greedy search. Journal of\n\nMachine Learning Research, 3:507\u2013554, 2002.\n\n[21] J. D. Ramsey, S. J. Hanson, C. Hanson, Y. O. Halchenko, R. A. Poldrack, and C. Gly-\n\nmour. Six problems for causal inference from fMRI. NeuroImage, 2009. In press.\n\n9\n\n\f", "award": [], "sourceid": 1036, "authors": [{"given_name": "Arthur", "family_name": "Gretton", "institution": null}, {"given_name": "Peter", "family_name": "Spirtes", "institution": null}, {"given_name": "Robert", "family_name": "Tillman", "institution": null}]}