{"title": "Bayesian Sparse Factor Models and DAGs Inference and Comparison", "book": "Advances in Neural Information Processing Systems", "page_first": 736, "page_last": 744, "abstract": "In this paper we present a novel approach to learn directed acyclic graphs (DAG) and factor models within the same framework while also allowing for model comparison between them. For this purpose, we exploit the connection between factor models and DAGs to propose Bayesian hierarchies based on spike and slab priors to promote sparsity, heavy-tailed priors to ensure identifiability and predictive densities to perform the model comparison. We require identifiability to be able to produce variable orderings leading to valid DAGs and sparsity to learn the structures. The effectiveness of our approach is demonstrated through extensive experiments on artificial and biological data showing that our approach outperform a number of state of the art methods.", "full_text": "Bayesian Sparse Factor Models and DAGs\n\nInference and Comparison\n\nTechnical University of Denmark\n\nTechnical University of Denmark\n\nRicardo Henao\nDTU Informatics\n\n2800 Lyngby, Denmark\nBioinformatics Centre\n\nUniversity of Copenhagen\n2200 Copenhagen, Denmark\n\nOle Winther\n\nDTU Informatics\n\n2800 Lyngby, Denmark\nBioinformatics Centre\n\nUniversity of Copenhagen\n2200 Copenhagen, Denmark\n\nrhenao@binf.ku.dk\n\nowi@imm.dtu.dk\n\nAbstract\n\nIn this paper we present a novel approach to learn directed acyclic graphs (DAGs)\nand factor models within the same framework while also allowing for model com-\nparison between them. For this purpose, we exploit the connection between factor\nmodels and DAGs to propose Bayesian hierarchies based on spike and slab pri-\nors to promote sparsity, heavy-tailed priors to ensure identi\ufb01ability and predictive\ndensities to perform the model comparison. We require identi\ufb01ability to be able to\nproduce variable orderings leading to valid DAGs and sparsity to learn the struc-\ntures. The effectiveness of our approach is demonstrated through extensive exper-\niments on arti\ufb01cial and biological data showing that our approach outperform a\nnumber of state of the art methods.\n\n1\n\nIntroduction\n\nSparse factor models have proven to be a very versatile tool for detailed modeling and interpretation\nof multivariate data, for example in the context of gene expression data analysis [1, 2]. A sparse\nfactor model encodes the prior knowledge that the latent factors only affect a limited number of the\nobserved variables. An alternative way of modeling the data is through linear regression between\nthe measured quantities. This multiple regression model is a well-de\ufb01ned multivariate probabilistic\nmodel if the connectivity (non-zero weights) de\ufb01nes a directed acyclic graph (DAG). What usually\nis done in practice is to consider either factor or DAG models. Modeling the data with both types\nof models at the same time and then perform model comparison should provide additional insight\nas these models are complementary and often closely related. Unfortunately, existing off-the-shelf\nmodels are speci\ufb01ed in such a way that makes direct comparison dif\ufb01cult. A more principled idea\nthat can phrased in Bayesian terms is for example to \ufb01nd an equivalence between both models, then\nrepresent them using a common/comparable hierarchy, and \ufb01nally use a marginal likelihood or a\npredictive density to select one of them. Although a formal connection between factor models and\nDAGs has been already established in [3], this paper makes important extensions such as explicitly\nmodeling sparsity, stochastic search over the order of the variables and model comparison.\n\nIs well known that learning the structure of graphical models, in particular DAGs is a very dif\ufb01cult\ntask because it turns out to be a combinatorial optimization problem known to be NP-hard [4]. A\ncommonly used approach for structure learning is to split the problem into two stages using the\nfact that the space of variable orderings is far more smaller than the space of all possible structures,\ne.g. by \ufb01rst attempting to learn a suitable permutation of the variables and then the skeleton of the\nstructure given the already found ordering or viceversa. Most of the work so far for continuous\ndata assumes linearity and Gaussian variables hence they can only recover the DAG structure up\n\n1\n\n\fto Markov equivalence [5, 6, 7, 8], which means that some subset of links can be reversed without\nchanging the likelihood [9]. To break the Markov equivalence usually experimental (interventional)\ndata in addition to the observational (non-interventional) data is required [10]. In order to obtain\nidenti\ufb01ability from purely observational data, strong assumptions have to to be made [11, 3, 12]. In\nthis work we follow the line of [3] by starting from a linear factor model and ensure identi\ufb01ability by\nusing non-normal heavy-tailed latent variables. As a byproduct we \ufb01nd a set of candidate orderings\ncompatible with a linear DAG, i.e. a mixing matrix which is \u201cclose to\u201d triangular. Finally, we may\nperform model comparison between the factor and DAG models inferred with \ufb01xed orderings taken\nfrom the candidate set.\n\nThe rest of the paper is organized as follows. Sections 2 to 5 we motivate and describe the different\ningredients in our method, in Section 6 we discuss existing work, in Section 7 experiments on both\narti\ufb01cial and real data are presented, and Section 8 concludes with a discussion and perspectives for\nfuture work.\n\n2 From DAGs to factor models\n\nWe will assume that an ordered d-dimensional data vector Px can be represented as a directed\nacyclic graph with only observed nodes, where P is the usually unknown true permutation ma-\ntrix. We will focus entirely on linear models such that the value of each variable is a linear weight\ncombination of parent nodes plus a driving signal z\n\n(1)\nwhere B is a strictly lower triangular square matrix. In this setting, each non-zero element of B\ncorresponds to a link in the DAG. Solving for x we can rewrite the problem as\n\nx = P\u22121BPx + z ,\n\nx = P\u22121APz = P\u22121(I \u2212 B)\u22121Pz ,\n\n(2)\nwhich corresponds to a noise-free linear factor model with the restriction that P\u22121AP must have a\nsparsity pattern that can be permuted to a triangular form since (I \u2212 B)\u22121 is triangular. This require-\nment alone is not enough to ensure identi\ufb01ability (up to scaling and permutation of columns Pf)1.\nWe further have to use prior knowledge about the distribution of the factors z. A necessary condition\nis that these must be a set of non-Gaussian independent variables [11]. For heavy-tailed data is it\noften suf\ufb01cient in practice to use a model with heavier tails than Gaussian [13]. If the requirements\nfor A and for the distribution of z are met, we can \ufb01rst estimate P\u22121AP and subsequently \ufb01nd P\nsearching over the space of all possible orderings. Recently, [3] applied the fastICA algorithm to\nsolve for the inverse mixing matrix P\u22121A\u22121P. To \ufb01nd a candidate solution for B, P is set such\nthat B found from the direct relation equation (1), B = I \u2212 A\u22121 (according to magnitude-based\ncriterion) is as close as possible to lower triangular. In the \ufb01nal step the Wald statistic is used for\npruning B and the chi-square test is used for model selection.\n\nIn our work we also exploit the relation between the factor models and linear DAGs. We apply a\nBayesian approach to learning a sparse factor models and DAGs, and the stochastic search for P\nis performed as an integrated part of inference of the sparse factor model. The inference of factor\nmodel (including order) and DAG parameters are performed as two separate inferences such that the\nonly input that comes from the \ufb01rst part is a set of candidate orders.\n\n3 From factor models to DAGs\n\nOur \ufb01rst goal is to perform model inference in the families of factor and linear DAG models. We\nspecify the joint distribution or probability of everything, e.g. for the factor model, as\np(X, A, Z, \u03a8, P, \u00b7) = p(X|A, Z, P, \u00b7)p(A|\u00b7)p(Z|\u00b7)p(\u03a8|\u00b7)p(P|\u00b7)p(\u00b7) ,\n\nwhere X = [x1, . . . , xN ], Z = [z1, . . . zN ], N is the number of observations and (\u00b7) indicates\nadditional parameters in the hierarchical models. The prior over permutation p(P|\u00b7) will always\nbe chosen to be uniform over the d! possible values. The actual sampling based inference for P is\ndiscussed in the next section and the standard Gibbs sampling components are provided in the sup-\nplementary material. Model comparison should ideally be performed using the marginal likelihood.\nThis is more dif\ufb01cult to calculate with sampling than obtaining samples from the posterior so we\nuse the predictive densities on a test set as a yardstick.\n\n1These ambiguities are not affecting our ability to \ufb01nd correct permutation P of the rows.\n\n2\n\n\fr\n\nFactor model\nInstead of using the noise-free factor model of equation (2) we allow for additive\nnoise x = P\u22121\nAPcz + \u01eb, where \u01eb is an additional Gaussian noise term with diagonal covariance\nmatrix \u03a8, i.e. uncorrelated noise, to account for independent measurement noise, Pr = P is the\npermutation matrix for the rows of A and Pc = Pf Pr another permutation for the columns with\nPf accounting for the permutation freedom of the factors. We will not restrict the mixing matrix\nA to be triangular. Instead we infer Pr and Pc using a stochastic search based upon closeness to\ntriangular as measured by a masked likelihood, see below. Now we can specify a hierarchy for the\nBayesian model as follows\n\nX|Pr, A, Pc, Z, \u03a8 \u223c N (X|P\u22121\n\nr\n\n\u03c8\u22121\n\ni\n\n|ss, sr \u223c Gamma(\u03c8\u22121\n\nAPcZ, \u03a8) ,\n|ss, sr) ,\n\ni\n\nZ \u223c \u03c0(Z|\u00b7) ,\n\nA \u223c \u03c1(A|\u00b7) ,\n\n(3)\n\nwhere \u03c8i are elements of \u03a8. For convenience, to exploit conjugate exponential families we are\nplacing a gamma prior on the precision of \u01eb with shape ss and rate sr. Given that the data is\nstandardized, the selection of hyperparameters for \u03c8i is not very critical as long as both \u201csignal and\nnoise\u201d are supported. The prior should favor small values of \u03c8i as well as providing support for\n\u03c8i = 1 such that certain variables can be explained solely by noise (we set ss = 2 and sr = 0.05 in\nthe experiments).\n\nFor the factors we use a heavy-tailed prior \u03c0(Z|\u00b7) in the form of a Laplace distribution parameterized\nfor convenience as a scale mixture of Gaussians [14]\n\nzjn|\u00b5, \u03bb \u223c Laplace(zjn|\u00b5, \u03bb) = Z \u221e\n\u03bb2|\u2113s, \u2113r \u223c Gamma(\u03bb2|\u2113s, \u2113r) ,\n\n0\n\nN (zjn|\u00b5, \u03c5)Exponential(\u03c5jn|\u03bb2)d\u03c5jn ,\n\n(4)\n\n(5)\n\nwhere zjn is an element of Z, \u03bb is the rate and \u03c5\nhas an exponential distribution acting as mixing den-\nsity. Furthermore, we place a gamma distribution on\n\u03bb2 to get conditionals for \u03c5 and \u03bb2 in standard con-\njugate families. We let the components of Z have\non average unit variance. This is achieved by setting\n\u2113s/\u2113r = 2 (we set \u2113s = 4 and \u2113r = 2). Alternatively\none may use a t distribution\u2014again as scale mixture\nof Gaussians\u2014which can to interpolate between very\nheavy-tailed (power law) and very light tails, i.e. be-\ncoming Gaussian when degrees of freedom approaches\nin\ufb01nity. However such \ufb02exibility comes at the price of\nbeing more dif\ufb01cult to select its hyperparameters, be-\ncause the model could become unidenti\ufb01ed for some\nsettings.\n\nj = 1 : d\n\n\u03bb\n\n\u03c5jn\n\n\u03c4ij\n\n\u03bdj\n\nqij\n\nzjn\n\naij\n\n\u03c8i\n\nrij\n\n\u03b7ij\n\nn = 1 : N\n\nxin\n\ni = 1 : d\n\nFigure 1: Graphical model for Bayesian\nhierarchy in equation (3).\n\nThe prior \u03c1(A|\u00b7) for the mixing matrix should be biased towards sparsity because we want to infer\nsomething close to a triangular matrix. Here we adopt a two-layer discrete spike and slab prior for\nthe elements aij of A similar to the one in [2]. The \ufb01rst layer in the prior control the sparsity of\neach element aij individually, whereas the second layer impose a per-factor sparsity level to allow\nelements within the same factor to share information. The hierarchy can be written as\n\naij|rij, \u03c8i, \u03c4ij \u223c (1 \u2212 rij)\u03b4(aij) + rijN (aij|0, \u03c8i\u03c4ij) ,\n\n\u03c4 \u22121\nij |ts, tr \u223c Gamma(\u03c4 \u22121\n\nij |ts, tr) ,\n\nrij|\u03b7ij \u223c Bernoulli(rij|\u03b7ij) ,\n\n\u03b7ij|qij, \u03b1p, \u03b1m \u223c (1 \u2212 qij)\u03b4(\u03b7ij) + qijBeta(\u03b7ij|\u03b1p\u03b1m, \u03b1p(1 \u2212 \u03b1m)) ,\n\n(6)\n\nqij|\u03bdj \u223c Bernoulli(qij|\u03bdj) ,\n\n\u03bdj|\u03b2m, \u03b2p \u223c Beta(\u03bdj|\u03b2p\u03b2m, \u03b2p(1 \u2212 \u03b2m)) ,\n\nwhere \u03b4(\u00b7) is a Dirac \u03b4-function. The prior above specify a point mass mixture over aij with mask\nrij. The expected probability of aij to be non-zero is \u03b7ij and is controlled through a beta hyperprior\nwith mean \u03b1m and precision \u03b1p. Besides, each factor has a common sparsity rate \u03bdj that let the\nelements \u03b7ij to be exactly zero with probability 1 \u2212 \u03bdj through a beta distribution with mean \u03b2m and\n\n3\n\n\fprecision \u03b2p, turning the distribution of \u03b7ij bimodal over the unit interval. The magnitude of non-\nzero elements in A is speci\ufb01ed through the slab distribution depending on \u03c4ij. The parameters for\n\u03c4ij should be speci\ufb01ed in the same fashion as \u03c8i but putting more probability mass around aij = 1,\nfor instance ts = 4 and tr = 10. Note that we scale the variances with \u03c8i since it makes the\nmodel easier to specify and tend to have better mixing properties [15]. The masking matrix rij with\nparameters \u03b7ij should be somewhat diffuse while favoring relatively large masking probabilities,\ne.g. \u03b1p = 10 and \u03b1m = 0.9. Additionally, qj and should favor very small values with low variance,\nthis is for example \u03b2p = 1000 and \u03b2m = 0.005. The graphical model for the entire hierarchy in (3)\nomitting parameters is shown in Figure 1.\n\nDAG We make the following Bayesian speci\ufb01cation of linear DAG model of equation (1) as\n\nX|Pr, B, X, \u00b7 \u223c \u03c0(X \u2212 P\u22121\n\nr\n\nB|\u00b7) , B \u223c \u03c1(B|\u00b7) ,\n\n(7)\n\nwhere \u03c0 and \u03c1 are given by equations (4) and (6). The Bayesian speci\ufb01cation for the DAG has a\nsimilar graphical model to the one in Figure 1 but without noise variances \u03a8. The factor model\nneeds only shared variance parameter \u03bb for the Laplace distributed zjn because a change of scale in\nA is equivalent to change of variance in zjn. The DAG on the other hand, needs individual variance\nparameters because it has no scaling freedom. Given that we know that B is strictly lower triangular,\nit should be in general less sparse than A, thus we use a different setting for the sparsity prior, i.e.\n\u03b2p = 100 and \u03b2m = 0.01.\n\n4 Sampling based inference\n\nFor given permutation P, Gibbs sampling can be used for inference of the remaining parameters. De-\ntails of Gibbs sampler is given in the supplementary material and we will focus on the non-standard\ninference corresponding to the sampling over permutations. There are basically two approaches to\n\ufb01nd P, one is perform the inference for parameters and P jointly with B restricted to be triangular.\nThe other is to let the factor model be unrestricted and search for P according to a criterion that does\nnot affect parameter inference. Here we prefer the latter for two reasons. First, joint combinatorial\nand parameter inference in this model will probably have poor mixing with slow convergence. Sec-\nond, we are also interested in comparing the factor model against the DAG for cases when we cannot\nreally assume that the data is well approximated by a DAG. In our approach the proposal P\u22c6 corre-\nsponds to picking two of the elements in the order vector by random and exchanging them. Other\napproaches such as restricting to pick two adjacent elements have been suggested as well [16, 7].\nFor the linear DAG model we are not performing joint inference of P and the model parameters.\nRather we use a set of Ps found for the factor model to be good candidates for the DAG.\nThe stochastic search for P = Pc goes as follows: we make inference for the unrestricted factor\nc |Pc) which is the uniform two\nmodel, propose P\u22c6\nvariable random exchange. With this proposal and the \ufb02at prior over P, we use a Metropolis-\nHastings acceptance probability simply as the ratio of likelihoods with A masked to have zeros\nabove its diagonal (through masking matrix M)\n\nc independently according q(P\u22c6\n\nr |Pr)q(P\u22c6\n\nr and P\u22c6\n\nN (X|(P\u22c6\n\n\u03be\u2192\u22c6 =\n\nr )\u22121(M \u2299 P\u22c6\nr\n\nA(P\u22c6\nr (M \u2299 PrAP\u22121\n\nc )\u22121)P\u22c6\nc )Pc, \u03a8)\n\nN (X|P\u22121\n\nc , \u03a8)\n\n,\n\nThe procedure can be seen as a simple approach for generating hypotheses about good, close to\ntriangular A, orderings in a model where the spike and slab prior provides bias towards sparsity.\n\nTo learn DAGs we \ufb01rst perform inference on the factor model speci\ufb01ed by the hierarchy in (3) to\nobtain a set of ordering candidates sorted according to their usage during sampling\u2014after the burn-\nin period. It is possible that the estimation of A might contain errors, e.g. a false zero entry on A\nallowing several orderings leading to several lower triangular versions of A, only one of those being\nactually correct. Thus, we propose not only to use the best candidate but a set of top candidates of\nsize mtop = 10. Then we perform inference on the DAG model corresponding to the structure search\n(mtop)\nhierarchy in (7), for each one of the permutation candidates being considered, P(1)\n.\n, . . . , P\nr\nFinally, we select the DAG model among candidates using the predictive distribution for the DAG\nwhen a test set is available or just the likelihood if not.\n\nr\n\n4\n\n\f5 Predictive distributions and model comparison\n\nGiven that our model produces both DAG and a factor model estimates at the same time, it could\nbe interesting to estimate also whether one option is better than the other given the observed\ndata, for example in exploratory analysis when the DAG assumption is just one reasonable op-\ntion.\nIn order to perform the model comparison, we use predictive densities p(X\u22c6|X, M) with\nM = {MFA, MDAG}, instead of marginal likelihoods because the latter is dif\ufb01cult and expensive\nto compute by sampling, requiring for example thermodynamic integration. With Gibbs sampling,\nwe draw samples from the posterior distributions p(A, \u03a8, \u03bb|X, \u00b7) and p(B, \u03bb1, . . . , \u03bbm|X, \u00b7). The\naverage over the extensive variables associated with the test points p(Z\u22c6|\u00b7) is a bit more compli-\ncated because naively drawing samples from p(Z\u22c6|\u00b7) gives an estimator with high variance\u2014for\n\u03c8i \u226a \u03c5jn. In the following we describe how to do it for each model, omitting the permutation\nmatrices for clarity.\n\nFactor model We can compute the predictive distribution by taking the likelihood in equation (3)\nand marginalizing Z. Since the integral has no closed form we can approximate it using the Gaussian\ndistribution from the scale mixture representation as\np(X\u22c6|A, \u03a8, \u00b7) = Z p(X\u22c6|A, Z, \u03a8)p(Z|\u00b7)dZ \u2248\n\nn|0, A\u22a4UnA + \u03a8) ,\n\nN (x\u22c6\n\nrep\n\n1\n\nrep Yn\n\nXr\n\nwhere Un = diag(\u03c51n, . . . , \u03c5dn), the \u03c5jn are sampled from the prior and rep is the number of\nsamples generated to approximate the intractable integral (rep = 500 in the experiments). Then we\ncan average over p(A, \u03a8, \u03bb|X, \u00b7) to obtain p(X\u22c6|X, MFA).\n\nDAG In this case the predictive distribution is rather easy because the marginal over Z in equation\n(4) is just a Laplace distribution with mean BX\n\np(X\u22c6|B, \u00b7) = Z p(X\u22c6|B, X, Z)p(Z|\u00b7)dZ = Yi,n\n\nLaplace(xij|[BX]in, \u03bbi) ,\n\nIn practice we\nwhere [BX]ij is the element indexed by the i-th row and n-th column of BX.\ncompute the predictive densities for a particular X\u22c6 during sampling and then select the model\nbased on its ratio. Note that both predictive distributions depend directly on \u03bb\u2014the rate of Laplace\ndistribution, making the estimates highly dependent on its value. This is why it is important to have\nthe hyperprior on \u03bb of equation (5) instead of just \ufb01xing its value.\n\n6 Existing work\n\nAmong the existing approaches to DAG learning, our work is most closely related to LiNGAM\n(Linear Non-Gaussian Acyclic Model for causal discovery) [3] with several important differences:\nSince LiNGAM relies on fastICA to learn the mixing is not inherently sparse, hence a pruning\nprocedure based on Wald statistic and model \ufb01t second order information should be applied after\nobtaining an ordering for the variables. The order search in LiNGAM assumes that there is not\nestimation errors during fastICA model inference, then a single ordering candidate is produced.\nLiNGAM produces and select a \ufb01nal model among several candidates, but in contrast to our method\nsuch candidates are not different DAGs with different variable orderings but DAGs with different\nsparsity levels. The factor model inference in LiNGAM, namely fastICA is very ef\ufb01cient however\ntheir structure search involves repeated inversions of matrices of sizes d2 \u00d7 d2 which can make\nit prohibitive for large problems. More explicitly, the computational complexity of LiNGAM is\nroughly O(N\ufb01td6) where N\ufb01t is the number of model \ufb01t evaluations. In contrast, the complexity\nin our case is O(Nited2N ) where Nite is the total number of samples including burn-in periods for\nboth, factor model and DAG inferences. Finally, our model is more principled in the sense that all\nthe approach is within the same Bayesian framework, as a result it can be extended to for example\nbinary data or time series by selecting some suitable prior distributions.\n\nMuch work on Bayesian models for DAG learning already exist. For example, the approach pre-\nsented in [16] is a Gaussian Bayesian network and therefore suffers from lack of identi\ufb01ability.\nBesides, order search is performed directly for the DAG model making necessary the use of longer\n\n5\n\n\fsampler runs with a number of computational tricks when the problem is large (d > 10), i.e. when\nexhaustive order enumeration is not an option.\n\n7 Experiments\n\nWe consider four sets of experiments in the following. The \ufb01rst two consist on extensive experiments\nusing arti\ufb01cial data, the third addresses the model comparison scenario and the last one uses real\ndata previously published in [17]. In every case we ran 2000 samples after a burn-in period of 4000\niterations and three independent chains for the factor model, and a single chain with 1000 samples\nand 2000 as burn-in for the DAG2. Hyperparameter settings are discussed in Section 3.\n\nLiNGAM suite We evaluate the performance of our model against LiNGAM3 using the arti\ufb01cial\nmodel generator presented in [3]. The generator produces both dense and sparse networks with dif-\nferent degree of sparsity, Z is generated from a non-Gaussian heavy-tailed distribution, X is gener-\nated using equation (1) and then randomly permuted to hide the correct order, P. For the experiment\nwe have generated 1000 different dataset/models using d = {5, 10}, N = {200, 500, 1000, 2000}\nand the DAG was selected using the (training set) likelihood in equation (7). Results are summarized\nin Figure 2 using several performance measures. For the particular case of the area under the ROC\ncurve (AUC), we use the conditional posterior of the masking matrix, i.e. p(R|X, \u00b7) where R is\na matrix with elements rij. AUC is an important measure because it quanti\ufb01es how the model ac-\ncounts for the uncertainty of presence or absence of links in the DAG. Such uncertainty assessment\nis not possible in LiNGAM where the probability of having a link is simply zero or one, however\nthe AUC can be still computed.\n\n0.94\n\n0.92\n\n0.9\n\n0.88\n\n0.86\n\n0.84\n\n0.82\n\ne\nt\na\nr\n\ne\nv\ni\nt\ni\ns\no\np\n\ne\nu\nr\nT\n\n0.8\n\n0.78\n\n0.76\n\n0.85\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\ne\nt\na\nr\n\ne\nv\ni\nt\na\ng\ne\nn\ne\nu\nr\nT\n\n0.6\n\n0.55\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\nC\nU\nA\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n \n\n0.8\n\ne\nt\na\nr\n\nr\no\nr\nr\ne\n\ns\ng\nn\ni\nr\ne\nd\nr\nO\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nd=5 Ours\nd=5 LINGAM\nd=10 Ours\nd=10 LINGAM\n\n200\n\n500\n\n1000\n\n2000\n\n0.5\n\n200\n\n500\n\n1000\n\n2000\n\nN\n(a)\n\nN\n(b)\n\n0.1\n\n \n\n200\n\n500\n\n1000\n\n2000\n\n0\n200\n\n500\n\n1000\n\n2000\n\nN\n(c)\n\nN\n(d)\n\nFigure 2: Performance measures for LiNGAM suite. Symbols are: square for 5 variables, star for 10\nvariables, solid line for sFA and dashed line for LiNGAM. (a) True positive rate. (b) True negative\nrate. (c) Frequency of AUC being greater than 0.9. (d) Number of estimated correct orderings.\n\nIn terms of true negative rates, AUC and ordering error rate, our approach is signi\ufb01cantly better than\nLiNGAM. The true positive rate results in Figure 2(a) show that LiNGAM outperform our approach\nonly for N = 2000. However by comparing it to the true positive rate, it seems than LiNGAM prefer\nmore dense models which could be an indication of over\ufb01tting. Looking to the ordering errors, our\nmodel is clearly superior. It is important to mention that being able to compute a probability for a\nlink in the DAG to be zero, p(bij 6= 0|X, \u00b7), turns out to be very useful in practice, for example to\nreject links with high uncertainty or to rank them. To give an idea of running times on a regular\ntwo-core 2.5GHz machine, for d = 10 and N = 500: LiNGAM took in average 10 seconds and\nour method 170 seconds. However, when doubling the number of variables the times were 730 and\n550 seconds for LiNGAM and our method respectively, which is in agreement with our complexity\nestimates.\n\n2Source code available upon request (C with Matlab interface).\n3Matlab package available at http://www.cs.helsinki.\ufb01/group/neuroinf/lingam/.\n\n6\n\n\fBayesian networks repository Next we want to compare some of the state of the art (Gaussian)\napproaches to DAG learning on 7 well known structures4, namely alarm, barley, carpo, hail\ufb01nder,\ninsurance, mildew and water (d = 37, 48, 61, 56, 27, 35, 32 respectively). A single dataset of size\n1000 per structure was generated using a similar procedure to the one used before. Apart from\nours (sFA), we considered the following methods5: standard DAG search (DS), order-search (OS),\nsparse candidate pruning then DAG-search (DSC) [6], L1MB then DAG-search (DSL) [8], sparse-\ncandidate pruning then order-search (OSC) [7]. Results are shown in Figure 3, including the number\nof reversed links found due to ordering errors.\n\nwater\n\nmildew\n\ninsurance\n\nhail\ufb01nder\n\ncarpo\n\nbarley\n\nalarm\n\n \n\nwater\n\nmildew\n\ninsurance\n\nhail\ufb01nder\n\ncarpo\n\nbarley\n\nalarm\n\nDS\nOS\nOSC\nDSC\nDSL\nsFA\n\nwater\n\nmildew\n\ninsurance\n\nhail\ufb01nder\n\ncarpo\n\nbarley\n\nalarm\n\nwater\n\nmildew\n\ninsurance\n\nhail\ufb01nder\n\ncarpo\n\nbarley\n\nalarm\n\n \n0\n\n0.1\n\n0.2\nFalse positive rate\n\n(a)\n\n0\n\n0.2\n\n0.4\nFalse negative rate\n\n(b)\n\n0.6\n\n0.8\n\n0\n\n0.2\n\n0.4\nAUC\n(c)\n\n0\n\n0.2\nReversed links\n\n0.4\n\n0.6\n\n(d)\n\nFigure 3: Performance measures for Bayesian networks repository experiments.\n\nIn this case, our approach obtained slightly better results when looking at the false positive rate,\nFigure 3(a). The true negative rate is comparable to the other methods suggesting that our model\nin some cases is sparser than the others. AUC estimates are signi\ufb01cantly better because we have\ncontinuous probabilities for links to be zero (in the other methods we had to use a binary value).\nFrom Figure 3(d), the number of reversed links in the other methods is quite high as expected due to\nlack of identi\ufb01ability. Our model produced a small amount reversed links because it was not able to\n\ufb01nd any of the true orderings, but indeed something quite close. This results could be improved by\nrunning the sampler for a longer time or by considering more candidates. We also tried to run the\nother approaches with data generated from Gaussian distributions but the results were approximately\nequal to those shown in Figure 3. On the other hand, our approach performs similarly but the number\nof reversed links increases signi\ufb01cantly since the model is no longer identi\ufb01ed. The most important\nadvantage of the (Gaussian) methods used in this experiment is their speed. In all cases they are\nconsiderably faster than sampling based methods. Their speed make them very suitable for large\nscale problems regardless of their identi\ufb01ability issues.\n\nModel comparison For this experiment we have generated 1000 different datasets/models with\nd = 5 and N = {500, 1000} in a similar way to the \ufb01rst experiment but this time we selected\nthe true model to be a factor model or a DAG uniformly. In order to generate a factor model we\nbasically just need to be sure that A cannot be permuted to a triangular form. We kept 20% of the\ndata to compute the predictive densities to then select between all estimated DAG candidates and\nthe factor model. We found that for N = 500 our approach was able to select true DAGs 91.5% of\nthe times and true factor models 89.2%, corresponding to an overall error of 9.6%, For N = 1000\nthe true DAG and true factor model rates increased to 98.5% and 94.6% respectively. This results\ndemonstrate that our approach is very effective at selecting the true underlying structure in the data\nbetween the two proposed hypotheses.\n\nProtein-signaling network The dataset introduced in [17] consists on \ufb02ow cytometry measure-\nments of 11 phosphorylated proteins and phospholipids (Raf, Erk, p38, Jnk, Akt, Mek, PKA, PKC,\nPIP2, PIP3, PLC\u03b3). Each observation is a vector of quantitative amounts measured from single\ncells, generated from a series of stimulatory cues and inhibitory interventions. The dataset contains\nboth observational and experimental data. Here we are only using 1755 samples corresponding to\n\n4http://compbio.cs.huji.ac.il/Repository/.\n5Parameters: 10000 iterations, 5 candidates (SC, DSC), max fan-in of 5 (OS, OSC) and Or strategy and\n\nMDL penalty (DSL).\n\n7\n\n\fPLC\u03b3\n\nRaf\n\np38\n\nPKC\n\nJnk\n\nPKC\n\nMek\n\nPKA\n\nPKA\n\nErk\n\np38\n\nJnk\n\nPLC\u03b3\n\nPIP2\n\nAkt\n\nPIP2\n\nPIP3\n\nx 10\u22123\n\n8\n\n7\n\n6\n\n5\n\n4\n\n3\n\n2\n\n1\n\nAkt\n\n(a)\n\n1\n\no\ni\nt\na\nR\n\n0.95\n\n0\n\n\u22124000 \u22123800 \u22123600 \u22123400 \u22123200 \u22123000\n\n0.9\n1\n1\n\n2\n2\n\n3\n3\n\nLikelihood\n(d)\n\n5\n5\n\n6\n6\n\n4\n7\n7\n4\nOrderings\n(e)\n\nPLC\u03b3\n\n0.46\n\nPIP2\n\n0.23\n\n0.23\n\nPIP3\n\nErk\n\n0.93\n\n0.44\n\nAkt\n\n0.15\n\nPKA\n\nPKC\n\n0.37\n\nJnk\n\n0.02\n\n0.77\n\n0.24\n\np38\n\n0.93\n\nRaf\n\n(c)\n\nErk\n\nPIP3\n\nRaf\n\nMek\n\nMek\n\n(b)\n\n0.9\n\n0.8\n\ny\nc\na\nr\nu\nc\nc\nA\n\n8\n8\n\n9\n9\n\n0.7\n\n10\n10\n\nFigure 4: Result for protein-signaling network. (a) Text-\nbook signaling network as reported in [17]. (b) Estimated\nstructure using Bayesian networks [17].\n(c) Estimated\nstructure using our model.\n(e) Test likelihoods for the\nbest ordering DAG (dashed) and the factor model (solid).\n(d) Likelihood ratios (solid) and structure errors (dashed)\nfor all candidates considered by our method and their us-\nage. The Bayesian network is not able to identify the\ndirection of the links with only observational data.\n\npure observational data and randomly selected 20% of the data to compute the predictive densities.\nUsing the entire set will produce a richer model, however interventions are out of the scope of this\npaper. The textbook ground truth and results are presented in \ufb01gure 4. From the 21 possible links\nin \ufb01gure 4(a), the model from [17] was able to \ufb01nd 9, but also one falsely added link. In 4(b), a\nmarginal likelihood equivalent prior is used and they therefore cannot make any inferences about\ndirectionality from observational data alone, see Figure 4(b). Our model in Figure 4(c) was able to\n\ufb01nd 10 true links, one falsely added link and only two reversed links (RL), one of them is PIP2 \u2192\nPIP3 which according to the ground truth is bidirectional and the other one, PLC\u03b3 \u2192 PIP3 which\nwas also found reversed using experimental data in [17]. Note from \ufb01gure 4(e) that the predictive\ndensity ratios correlate quite well with the structural accuracy. The predictive densities for the best\ncandidate (sixth in Figure 4(e)) is shown in Figure 4(d) and suggests that the factor model is a better\noption which makes sense considering that estimated DAG in \ufb01gure 4(c) is a substructure of the\nground truth. We also examined the estimated factor model and we found out that three factors\ncould correspond to unmeasured proteins (PI3K, MKK and IP3), see Figure 2 and table 3 in [17].\nWe also tried the above methods. Results were very similar to our method in terms of true positives\n(\u2248 9) and true negatives (\u2248 32), however none of them were able to produce less than 6 reversed\nlinks that corresponds to approximately two-thirds of total true positives.\n\n8 Discussion\n\nWe have proposed a novel approach to perform inference and model comparison of sparse factor\nmodels and DAGs within the same framework. The key ingredients for both Bayesian models are\nspike and slab priors to promote sparsity, heavy-tailed priors to ensure identi\ufb01ability and predictive\ndensities to perform the comparison. A set of candidate orderings is produced by the factor model.\nSubsequently, a linear DAG is learned for each of the candidates. To the authors\u2019 knowledge this\nis the \ufb01rst time that a method for comparing such a closely related linear models is proposed. This\nsetting can be very bene\ufb01cial in situations where the prior evidence suggests both DAG structure\nand/or unmeasured variables in the data. For example in the protein signaling network [17], the\ntextbook ground truth suggests both DAG structure and a number of unmeasured proteins. The\nprevious approach [17] only performed structure learning in DAGs but our results suggest that the\ndata is better explained by the factor model. For further exploration of this data set, we obviously\nneed to modify our approach to handle hybrid models, i.e. graphs with directed/undirected links\nand observed/latent nodes as well as being able to use experimental data. Our Bayesian hierarchical\napproach is very \ufb02exible. We are currently investigating extensions to other source distributions\n(non-parametric Dirichlet process, temporal Gaussian processes and discrete).\n\n8\n\n\fReferences\n\n[1] M. West. Bayesian factor regression models in the \u201clarge p, small n\u201d paradigm. In J. Bernardo, M. Ba-\nyarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, editors, Bayesian Statistics 7, pages\n723\u2013732. Oxford University Press, 2003.\n\n[2] J. Lucas, C. Carvalho, Q. Wang, A. Bild, J. R. Nevins, and M. West. Bayesian Inference for Gene\nExpression and Proteomics, chapter Sparse Statistical Modeling in Gene Expression Genomics, pages\n155\u2013176. Cambridge University Press, 2006.\n\n[3] S. Shimizu, P. O. Hoyer, A. Hyv\u00a8arinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal\n\ndiscovery. Journal of Machine Learning Research, 7:2003\u20132030, October 2006.\n\n[4] D. M. Chickering. Learning Bayesian networks is NP-complete. In D. Fisher and H.-J. Lenz, editors,\n\nLearning from Data: AI and Statistics, pages 121\u2013130. Springer-Verlag, 1996.\n\n[5] I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The max-min hill-climbing Bayesian network structure\n\nlearning algorithm. Machine Learning, 65(1):31\u201378, October 2006.\n\n[6] N. Friedman, I. Nachman, and D. Pe\u2019er. Learning Bayesian network structure from massive datasets: The\n\n\u201csparse candidate\u201d algorithm. In K. B. Laskey and H. Prade, editors, UAI, pages 206\u2013215, 1999.\n\n[7] M. Teyssier and D. Koller. Ordering-based search: A simple and effective algorithm for learning Bayesian\n\nnetworks. In UAI, pages 548\u2013549, 2005.\n\n[8] M. W. Schmidt, A. Niculescu-Mizil, and K. P. Murphy. Learning graphical model structure using L1-\n\nregularization paths. In AAAI, pages 1278\u20131283, 2007.\n\n[9] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of\n\nknowledge and statistical data. Machine Learning, 20(3):197\u2013243, January 1995.\n\n[10] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, March 2000.\n[11] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287\u2013314, Decem-\n\nber 1994.\n\n[12] C. M. Carvalho, J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang, and M. West. High-dimensional sparse fac-\ntor modeling: Applications in gene expression genomics. Journal of the American Statistical Association,\n103(484):1438\u20131456, December 2008.\n\n[13] A. Hyv\u00a8arinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wiley-Interscience, May 2001.\n[14] D. F. Andrews and C. L. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical\n\nSociety: Series B (Methodology), 36(1):99\u2013102, 1974.\n\n[15] T. Park and G. Casella.\n\n103(482):681\u2013686, June 2008.\n\nThe Bayesian lasso.\n\nJournal of the American Statistical Association,\n\n[16] N. Friedman and D. Koller. Being Bayesian about network structure: A Bayesian approach to structure\n\ndiscovery in Bayesian networks. Machine Learning, 50(1\u20132):95\u2013125, January 2003.\n\n[17] K. Sachs, O. Perez, D. Pe\u2019er, D. A. Lauffenburger, and G. P. Nolan. Causal protein-signaling networks\n\nderived from multiparameter single-cell data. Science, 308(5721):523\u2013529, April 2005.\n\n9\n\n\f", "award": [], "sourceid": 643, "authors": [{"given_name": "Ricardo", "family_name": "Henao", "institution": null}, {"given_name": "Ole", "family_name": "Winther", "institution": null}]}