{"title": "A mixture model for the evolution of gene expression in non-homogeneous datasets", "book": "Advances in Neural Information Processing Systems", "page_first": 1297, "page_last": 1304, "abstract": "We address the challenge of assessing conservation of gene expression in complex, non-homogeneous datasets. Recent studies have demonstrated the success of probabilistic models in studying the evolution of gene expression in simple eukaryotic organisms such as yeast, for which measurements are typically scalar and independent. Models capable of studying expression evolution in much more complex organisms such as vertebrates are particularly important given the medical and scientific interest in species such as human and mouse. We present a statistical model that makes a number of significant extensions to previous models to enable characterization of changes in expression among highly complex organisms. We demonstrate the efficacy of our method on a microarray dataset containing diverse tissues from multiple vertebrate species. We anticipate that the model will be invaluable in the study of gene expression patterns in other diverse organisms as well, such as worms and insects.", "full_text": "A mixture model for the evolution of gene expression\n\nin non-homogeneous datasets\n\nGerald Quon1, Yee Whye Teh2, Esther Chan3, Timothy Hughes3, Michael Brudno1,3,\n\n1Department of Computer Science, and 3Banting and Best Department of Medical Research,\n\nQuaid Morris3\n\nUniversity of Toronto, Canada,\n\n2Gatsby Computational Neuroscience Unit, University College London, United Kingdom\n\n{gerald.quon,quaid.morris}@utoronto.ca\n\nAbstract\n\nWe address the challenge of assessing conservation of gene expression in com-\nplex, non-homogeneous datasets. Recent studies have demonstrated the success\nof probabilistic models in studying the evolution of gene expression in simple\neukaryotic organisms such as yeast, for which measurements are typically scalar\nand independent. Models capable of studying expression evolution in much more\ncomplex organisms such as vertebrates are particularly important given the medi-\ncal and scienti\ufb01c interest in species such as human and mouse. We present Brow-\nnian Factor Phylogenetic Analysis, a statistical model that makes a number of\nsigni\ufb01cant extensions to previous models to enable characterization of changes\nin expression among highly complex organisms. We demonstrate the ef\ufb01cacy of\nour method on a microarray dataset pro\ufb01ling diverse tissues from multiple verte-\nbrate species. We anticipate that the model will be invaluable in the study of gene\nexpression patterns in other diverse organisms as well, such as worms and insects.\n\n1 Introduction\n\nHigh-throughput functional data is emerging as an indispensible resource for generating a complete\npicture of genome-wide gene and protein function. Currently, gene function is often inferred through\nsequence comparisons with genes of known function in other species, though sequence similarity\nis no guarantee of shared biological function. Gene duplication, one of the primary forces of ge-\nnomic evolution, often gives rise to genes with high sequence similarity but distinct biological roles\n[1]. Differences in temporal and spatial gene expression patterns have also been posited to explain\nphenotypic differences among animals despite a surprisingly large degree of gene sequence simi-\nlarity [2]. This observation and the increasingly wide availability of genome-wide gene expression\npro\ufb01les from related organisms has motivated us to develop statistical models to study the evolu-\ntion of gene expression along phylogenies, in order to identify lineages where gene expression and\ntherefore gene function is likely to be conserved or diverged.\n\nComparing gene expression patterns between distantly related multi-cellular organisms is challeng-\ning because it is dif\ufb01cult to collect a wide range of functionally matching tissue samples. In some\ncases, matching samples simply may not exist because some organismal functions have been redis-\ntributed among otherwise homologous organs. For example, processes such as B-cell development\nare performed by both distinct and overlapping sets of tissues: primarily bone marrow in mammals;\nBursa of Fabricus and bone marrow in birds; and likely kidney, spleen, and/or thymus in teleost \ufb01sh\n(who lack bone marrow) [3]. Matching samples can also be hard to collect because anatomical ar-\nrangements of some of the queried organisms make isolation of speci\ufb01c tissues virtually impossible.\nFor example, in frog, the kidneys are immediately adjacent to the ovaries and are typically covered in\noocytes. By allowing tissue samples to be mixed and heterogeneous, though functionally related, it\n\n1\n\n\fbecomes possible to compare expression patterns describing a much larger range of functions across\na much larger range of organisms.\n\nCurrent detailed statistical models of expression data assume measurements from matched samples\nin each organism. As such, comparative studies of gene expression to date have either resorted to\nsimple, non-phylogenetic measures to compare expression patterns [4], or restricted their compar-\nisons to single-cellular organisms [5] or clearly homologous tissues in mammals [6].\n\nHere, we present Brownian Factor Phylogenetic Analysis (BFPA), a new model of gene expres-\nsion evolution that removes the earlier limitations of matched samples, therefore allowing detailed\ncomparisons of expression patterns from the widely diverged multi-cellular organisms. Our model\ntakes as input expression pro\ufb01les of orthologous genes in multiple present-day organisms and a phy-\nlogenetic tree connecting those organisms, and simultaneously reconstructs the expression pro\ufb01les\nfor the ancestral nodes in the phylogenetic tree while detecting links in the phylogeny where rapid\nchange of the expression pro\ufb01le has occurred.\n\nWe model the expression data from related organisms using a mixture of Gaussians model related\nto a mixture of constrained factor analyzers [7]. In our model, each mixture component represents a\ndifferent pattern of conservation and divergence of gene expression along each link of the phyloge-\nnetic tree. We assume a constrained linear mapping between the heterogeneous samples in different\norganisms and \ufb01t this mapping using maximum likelihood. We show that by expanding the amount\nof expression data that can be compared between species, our model generates more useful infor-\nmation for predicting gene function and is also better able to reconstruct the evolutionary history of\ngene expression as evidenced by its increased accuracy in reconstructing gene expression levels.\n\n2 Previous work\n\nRecent evolutionary models of gene expression treat it as a quantitative (i.e. real-valued) trait and\nmodel evolutionary change in expression levels as a Brownian motion process [8, 9]. Assuming\nBrownian motion, a given gene\u2019s expression level xs in a child species s after a divergence time ts\nfrom an ancestral species \u03c0(s) is predicted to be Gaussian distributed with a mean x\u03c0(s) equal to the\ngene\u2019s expression level in the ancestor and variance \u03c32ts:\n\nxs \u223c N(x\u03c0(s), \u03c32ts)\n\n(1)\n\nwhere \u03c32 represents the expected rate of change per unit time. The ancestor-child relationships are\nspeci\ufb01ed using a phylogeny, such as that shown in Figure 1a for the vertebrates. The leaves of the\nphylogeny are associated with present-day species and the internal branch points with shared ances-\ntors. The exact position of the root of the phylogeny (not shown in the \ufb01gure, but somewhere along\nbranch \u201dT\u201d) cannot be established without additional information, and the outgroup species \u201dT\u201d is\noften used in place of the root of the tree. Nonetheless, the rooted phylogeny can be interpreted as a\ndirected Gaussian graphical model, e.g. Figure 1b, whose nodes are variables representing expres-\nsion levels in the corresponding species and whose directed edges point from immediate ancestors\nto their children species. The conditional probability distribution (CPD) at each node is given by\nEquation 1.\n\nTypical uses of these evolutionary models are to compare different hypotheses about divergence\ntimes [8] or the structure of the phylogeny [9] by calculating the likelihood of the present-day ex-\npression levels under various hypotheses. To avoid assigning this prior over the root node and\nthus introducing bias [10], Felsenstein developed a method called restricted maximum likelihood\n(REML) [11], which speci\ufb01es a distribution over the observed differences between present-day ex-\npression levels rather than the expression levels themselves.\n\n3 Brownian Factor Phylogenetic Analysis: A model of expression evolution\n\nIn the following section, we propose changes to the Brownian motion model that not only allow\nfor unmatched tissue samples, but also leverage the change observed in expression levels across\nmultiple genes in order to classify genes into different patterns of expression evolution. We use xi\ns\nto indicate the hidden expression pro\ufb01le of the i-th gene (out of N ortholog groups) in species s.\n\n2\n\n\fFigure 1: Our statistical model and associated species phylogenies. (a) The phylogeny of the species\nmeasured in our dataset of human (H), mouse (M), chicken (C), frog (F), and tetraodon (T), as well\nas an example phylogeny of three hypothetical species x1, x2, and x3 used to illustrate our model.\n(b) Our statistical model showing how the outgroup species x3 and its corresponding observed ex-\npression levels \u02c6x3 is used as a gene expression prior. Edge weights on the graph depict scaling\nfactors applied to the variance terms \u03a3, which are speci\ufb01ed by each conservation pattern c. 1 de-\nnotes no scaling on that branch, whereas \u03c1 > 1 depicts a longer, and thus unconserved, branch. This\nparticular conservation pattern represents a phylogeny where all species have conserved expression.\nThe scale on the bottom shows hypothetical values for x1, x2, and x3, as well as the inferred value\nfor x12. (c) The same model except applied to a conservation pattern where species x3 is determined\nto exhibit signi\ufb01cantly different expression levels (rapid change).\n\ns}N\nThe input to our model are vectors of tissue-speci\ufb01c expression levels {\u02c6xi\ni=1 for N genes over\npresent-day species s \u2208 {P \u222a o}; we distinguish the chosen outgroup species o from the rest of\ns \u2208 IRds, where ds is the number of tissues in species s. The goal of\nthe present-day species P . \u02c6xi\nour model is to infer each gene\u2019s corresponding pattern of gene expression evolution (conservation\ns}N\ni=1 for all species s \u2208 {P \u222a o \u222a A}, where A\npattern) {ci}N\nrepresents the internal ancestral species in the phylogenetic tree (Figure 1). The likelihood function\n\ni=1, \u03b8(cid:1) is shown below, where \u03c0(s) refers to the parent species\n\ni=1 and latent expression levels {xi\nP , xi\n\nP\u222ao\u222aA, ci}N\n\nof s, \u03b8 = (\u039b, \u03a3, \u03b2, \u03c1, \u03b3) are the model parameters, and N(x; \u00b5, \u03a3) is the density of x under a\nmultivariate normal distribution with mean \u00b5 and covariance \u03a3:\ns|xi\n\u03c0(s), \u03c1Kj,s\n\ns\u2208P\u222aA P (xi\ns|xi\n\n(cid:17) \u00d7(cid:0)Q\n\nL = P(cid:0){\u02c6xi\nL =Q\n\ns, \u03b2)(cid:1)i\n\no, \u03b2)P (ci|\u03b3)\n\nh(cid:16)Q\n\ns \u03a3s)\n\ni=1|{\u02c6xi\n\no}N\n\nP (xi\n\nP (xi\n\no|\u02c6xi\n\ni\n\ns\u2208P P (\u02c6xi\n\n\u03c0(s), ci, \u03b8)\n\ns|xi\n\u03c0(s), ci = Kj, \u03b8) = N(xi\ns, \u03b2) = N(\u02c6xi\n\ns; \u039bsxi\ns|xi\ns; xi\nP (ci = Kj|\u03b3) = \u03b3j\n\nP (\u02c6xi\n\ns, \u03b2s)\n\n(2)\n\n(3)\n(4)\n\nModeling branch lengths. Equation 2 re\ufb02ects the central assumption of Brownian motion models [8,\n9, 10] described in Equation 1, extended in two ways. BFPA extends this concept in two directions.\nFirst, we constrain all variances \u03a3s to be diagonal in order to estimate tissue-speci\ufb01c drift rates,\nas tissues are known to vary widely in expression divergence rates [12]. Secondly, we note that in\nstudying a diverse lineage such as vertebrates, we expect to see large changes in expression for genes\nthat have diverged in function, as compared to genes of conserved function. We therefore model the\ndrift of a gene\u2019s expression levels along each branch of the tree as following one of two rates: a\nslow rate, re\ufb02ecting a functional constraint, and a fast rate, re\ufb02ecting neutral or selected change.\nCorrespondingly, for each branch of the phylogenetic tree above the species s, we de\ufb01ne two rate\ns = 1.0\nparameters, \u03c12\ns or \u03c11\nand initialize \u03c11\ns to a much larger value to maintain this relationship during learning, thus modeling\nfast-moving genes as outliers. Our method of modelling constrained and unconstrained change as\nscalar multiples of a common variance is similar to the discrete gamma method [13].\n\ns, termed a short and long branch respectively (\u03c12\n\ns). We \ufb01x \u03c12\n\ns < \u03c11\n\n3\n\n(cid:1)(cid:1)(cid:1)(cid:2)(cid:1)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:1)(cid:2)(cid:3)LS(cid:7)(cid:8)(cid:9)(cid:10)(cid:1)(cid:2)(cid:3)rLS(cid:1)(cid:1)(cid:1)(cid:2)(cid:1)(cid:1)r(cid:11)(cid:3)(cid:4)(cid:3)(cid:4)(cid:5)(cid:3)(cid:4)(cid:5)(cid:6)(cid:11)(cid:12)(cid:1)(cid:13)(cid:1)(cid:7)(cid:8)(cid:9)(cid:10)(cid:1)(cid:2)(cid:3)LS(cid:11)(cid:12)(cid:1)(cid:13)(cid:1)(cid:1)(cid:7)(cid:8)(cid:9)(cid:10)(cid:1)(cid:2)(cid:3)rLS(cid:1)(cid:1)(cid:1)(cid:1)(cid:11)(cid:11)(cid:3)(cid:4)(cid:5)(cid:6)(cid:14)(cid:11)(cid:11)(cid:4)(cid:4)(cid:12)(cid:12)(cid:2)(cid:15)(cid:7)(cid:8)(cid:9)(cid:10)(cid:1)(cid:2)(cid:3)(cid:5)b(cid:11)(cid:2)(cid:1)(cid:11)(cid:1)(cid:12)(cid:1)(cid:11)(cid:1)(cid:12)(cid:1)(cid:11)(cid:12)(cid:1)(cid:12)(cid:2)(cid:1)(cid:11)(cid:2)(cid:1)(cid:4)(cid:4)(cid:11)(cid:1)(cid:12)(cid:1)(cid:1)(cid:1)(cid:11)(cid:12)(cid:1)(cid:11)(cid:1)(cid:12)(cid:1)(cid:1)(cid:1)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:9)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:11)(cid:12)(cid:1)(cid:11)(cid:1)(cid:12)(cid:1)(cid:1)(cid:1)(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9)(cid:10)(cid:11)(cid:9)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:11)(cid:12)(cid:1)(cid:11)(cid:12)(cid:1)(cid:12)(cid:13)(cid:14)(cid:12)(cid:15)(cid:14)(cid:12)(cid:16)(cid:14)\fLinear relationship between ancestral and child tissues. We model tissues of child species as linear\ncombinations of ancestral tissues. The matrix of coef\ufb01cients \u039bs that relate expression levels in\nthe child species\u2019 tissues to that of its parent species is heavily constrained to leverage our prior\nunderstanding of the relationships of speci\ufb01c tissues [14]. To construct \u039bs, pairs of tissues that were\nthe heart) had their corresponding entry in \u039bs \ufb01xed at 1, and all other\nclearly homologous (i.e.\nentries in the same row set to zero. For the remaining tissues, literature searches were conducted\nto determine which groups of tissues had broadly related function (i.e. immune tissues), and those\nentries were allowed to vary from zero. All other entries were constrained to be zero.\n\nDistinguishing intra- and inter-species variation. Equation 3 relates the observed expression levels\nof present-day species to the noiseless, inferred expression levels of the corresponding hidden nodes\nof each observed species. The variance factor \u03b2s is an estimate of the variation expected due to\nnoise in the array measurements, and are estimated via maximum likelihood using multiple identical\nprobes present on each microarray.\n\nConservation pattern estimation. Our goal is to identify different types of expression evolution,\nincluding punctuated evolution, fully conserved expression, or rapid change along all branches of\nthe phylogeny. We model the problem as a mixture model of conservation patterns, in which each\nconservation pattern speci\ufb01es either constrained or fast change along each branch of the tree. Each\nconservation pattern Kj \u2208 {1, 2}|P\u222aA| speci\ufb01es a con\ufb01guration of \u03c11\ns for each species s\n(Kj,s \u2208 {1, 2} speci\ufb01es \u03c1Kj,s\n). However, not all 2|P\u222aA| possible patterns of short and long branches\ncan be uniquely considered. In particular, a tree containing at least one ancestor incident to two long\nbranches and one short are ambiguous because this tree cannot be distinguished from the same tree\nwith that ancestor incident to three long branches. As a post-processing step, we consider short\nbranches in those cases to be long, and sum over such ambiguous trees, leaving a total of J possible\nconservation patterns. Each pattern Kj is assigned a prior probability P (Kj) = \u03b3j that is learned,\nas re\ufb02ected in Equation 4.\n\ns or \u03c12\n\ns\n\n4 Inference\n\nBecause our graphical model contains no cycles, we can apply belief propagation to perform exact\ninference and obtain the posterior distributions P (ci = Kj|\u02c6xi, \u03b8), \u2200i, j:\nP , ci = Kj|\u02c6xi\n\n\u03b4ij = P (ci = Kj|\u02c6xi, \u03b8) \u221d\n\nP\u222ao\u222aA, \u02c6xi\n\no, \u03b8)\u2202xi\n\nP\u222ao\u222aA\n\nP (xi\n\n(5)\n\nZ\n\nWe can also estimate the distributions over expression levels of a species s0 as\n\nP (xi\n\nP\u222ao\u222aA, \u02c6xi\n\nP , ci = Kj|\u02c6xi\n\no, \u03b8)\u2202xi\n\nP\u222ao\u222aA\\s0\n\n(6)\n\ns0|\u02c6xi, \u03b8) \u221dX\n\nZ\n\nP (xi\n\nj\n\n5 Learning\n\nApplying the expectation maximization (EM) algorithm yields the following maximum likeli-\nhood estimates of the model parameters, where Es,s|Kj = E[xi\ns, ci = Kj], Es,\u03c0(s)|Kj =\nE[xi\n\ns, ci = Kj], and E\u03c0(s),\u03c0(s)|Kj = E[xi\n\ns |\u02c6xi\ns, ci = Kj]:\n\n\u03c0(s)|\u02c6xi\nsxiT\n\n\u03c0(s)xiT\n\nsxiT\n\ni=1\n\nj=1\n\n\u03b4ij\n\u03c1Kj,s\ns\n\nEs,\u03c0(s)|Kj\n\n\uf8f6\uf8f8\u22121\n\n\uf8eb\uf8ed NX\nJX\n(cid:26)PN\nPJ\n(cid:16)P\nP\ns ] + tr(cid:2)\u039bT\n(cid:0)tr[Es,s|Kj \u03a3\u22121\n\n\u03c0(s)|\u02c6xi\n\uf8f6\uf8f8\uf8eb\uf8ed NX\nJX\n(cid:16)Es,s|Kj \u2212 2\u039bsET\n(cid:17)\u22121(cid:16)P\ns (\u22122Es,\u03c0(s)|Kj + \u039bsE\u03c0(s),\u03c0(s)|Kj )(cid:3)(cid:1)(cid:1)\n\nP\nj[Kj,s = k]\u03b4ij\u00d7\n\nj[Kj,s = k]\u03b4ijdim(xi\ns)\n\nE\u03c0(s),\u03c0(s)|Kj\n\n\u03b4ij\n\u03c1Kj,s\ns\n\ns \u03a3\u22121\n\ns,\u03c0(s)|Kj\n\n\u03b4ij\nKj,s\n\u03c1\ns\n\n+ \u039bsE\u03c0(s),\u03c0(s)|Kj \u039bT\n\nj=1\n\ni=1\n\nj=1\n\ni=1\n\ns\n\ni\n\ni\n\n\u02c6\u039bs =\n\n\u02c6\u03a3s = 1\n\nN diag\n\n\u02c6\u03c1k\ns =\n\n(7)\n\n(cid:17)(cid:27)\n\n4\n\n\fPN\n\ni=1 \u03b4ij\nN\n\n\u02c6\u03b3j =\n\n(8)\n\nAlthough we have rooted the phylogeny using a present-day species rather than place a hypothetical\nroot as has been done in previous Brownian motion models, these two models are related because\nthey are equivalent under the condition that all samples are matched. First, note that in traditional\nBrownian motion models, the location of the root is arbitrary if one assumes a constant, improper\nprior over the root expression levels, since any choice of root would give rise to the same probability\ndistribution over the expression levels. By using a present-day species with observed expression\nlevels as the root node, we avoid integrating over this improper prior. Because the root node prior\nis constant, the likelihood of the other present-day species conditioned on this present-day root\nexpression level is a constant times the likelihood of all present-day species expression levels. Our\nconditional model therefore assigns identical likelihoods and marginals as REML.\n\n6 Results\n\nWe present the results of applying our model to a novel dataset consisting of gene expression mea-\nsurements of 4770 genes with unique, unambiguous orthology, i.e., each of the 4770 genes is present\nin only a single copy, across the following \ufb01ve present-day organisms: human, mouse, chicken, frog,\nand tetraodon. The phylogeny related these species is shown in Figure 1 with nodes labelled by the\n\ufb01rst letter of the species name. We set Tetradon as the root, so o = T and P = {H, M, C, F}\nand we label the internal ancestors by concatenating the labels of their present-day descendants, so\nA = {HM, HM C, HM CF}.\nReplicate microarray probe intensity measurements were taken for the 4770 genes across a total of\n161 tissues (i.e., 322 microarrays in total) in the \ufb01ve organisms: 46 tissues from human, 55 from\nmouse, and 20 from each of the other three organisms. We applied a standard pre-processing pipeline\nto the array set to remove experimental artifacts and to transform the probe intensity measurements\non each array to a common, variance-stabilized scale. Each array was \ufb01rst spatially detrended as\ndescribed in [15]. Within a species, all arrays share the same probe set, so we applied VSN [16]\nto the arrays from each species to estimate an array-speci\ufb01c af\ufb01ne transform to transform the probe\nintensities to species-speci\ufb01c units. We next applied an arcsinh transform to the probe intensities\nto make the variance of the noise independent of the intensity measurement. For the \ufb01nal two pre-\nprocessing steps, we placed the transformed intensity measurements into a matrix for each species.\nThe rows of this matrix correspond to genes and the columns are the measured tissues. First, to re-\nmove probe bias in the transformed intensities, we subtracted the row median from each element and\nthen to attempt to transform measurements from different species to a common scale, we subtracted\nthe column means from each element and divided by the column length.\n\nFirst, we investigate the stability of our conservation pattern estimates by using parameters trained on\ndifferent random subsamples of our genes. We then evaluate the predictive value of our algorithm\nBFPA using two tasks: a) predicting gene expression pro\ufb01les in a new species given expression\npro\ufb01les in other species, and b) predicting Gene Ontology annotation using the conservation pattern\ninferred by our model.\n\nTo perform the stability experiments, we \ufb01rst randomly split the dataset into \ufb01ve subsets, and used\neach subset individually to train the model using 100 iterations of EM. We then estimated P (ci|\u02c6xi\ns, \u03b8)\nfor the four other subsets of genes, and classi\ufb01ed each gene into its most likely conservation pattern.\nHence, each gene is classi\ufb01ed four times by non-overlapping training sets. Figure 2 shows that\nthe classi\ufb01cations are quite stable and that most genes are classi\ufb01ed into few conservation patterns.\nMost genes that were uniquely classi\ufb01ed into a single conservation pattern either were classi\ufb01ed as\nfully (all) conserved or completely unconserved, resulting in relatively few high-con\ufb01dence lineage-\nspeci\ufb01c genes.\n\n6.1 Functional associations of co-transcriptionally evolving genes\n\nPairs of genes exhibiting correlated expression also tend to perform similar function. This guilt-\nby-association principle is often used to initially assign putative functions to genes. For example,\na popular method for analyzing gene expression datasets is to cluster genes based on the pairwise\n\n5\n\n\fFigure 2: Stability of conservation pattern assignments to genes. (left) Each gene was placed into\none of four bins, denoting the number of unique patterns it was classi\ufb01ed into. Most genes were\nconsistently classi\ufb01ed into one conservation pattern for all four of its independent classi\ufb01cations.\n(right) For all genes uniquely classi\ufb01ed into a single conservation pattern, the number of present-\nday species adjacent to conserved links was computed. Most genes were either classi\ufb01ed as fully\n(all) conserved or completely unconserved.\n\nPearson correlation coef\ufb01cient (PCC), then measure the enrichment of these clusters in Gene On-\ntology (GO) function and process annotations [17]. In this section, we introduce the evolutionary\ncorrelation coef\ufb01cient (ECC), a simple modi\ufb01cation of PCC to integrate model predictions, and ex-\namine whether genes with the same annotated function are more similar in rank according the ECC\nor PCC measures. ECC scales the positively-transformed PCC by the marginal probability of the\ngenes following the same expression evolution, assuming independent evolution.\n\nECC(\u02c6xi, \u02c6xk) = (cid:0)1 + P CC(\u02c6xi, \u02c6xk)(cid:1)X\n\nj\n\nP (ci = j|\u02c6xi, \u03b8)P (ck = j|\u02c6xk, \u03b8)\n\nECC can be applied using the output of either BFPA or the Brownian model. For the Brownian\nmodel, we trained and made predictions using only those matched samples in all \ufb01ve species. Those\nten samples are the central nervous system (CNS), intestine, heart, kidney, liver, eye, muscle, spleen,\nstomach, and testis. We also introduce ECC-sequence, designed to measure the value of evolutionary\ninformation derived from sequence. First, the protein sequences of each gene were aligned using\ndefault parameters of MUSCLE [18]. These alignments were then inputted into PAML [19] together\nwith the species tree shown in Figure 1 to estimate branch lengths. The PCC measure for each pair\nof genes was then scaled by the Pearson correlation coef\ufb01cient of the branch lengths estimated by\nPAML to produce ECC-sequence.\n\nFor all models, we \ufb01rst used the ECC/PCC similarity metric for each gene to rank all other genes\nin order of expression similarity. We then apply the Wilcoxon Rank Sum test to evaluate whether\ngenes with the same GO annotations, as annotated for the mouse ortholog, are signi\ufb01cantly higher in\nrank than all other genes. For this analysis, we only considered GO Process categories which have\nat least one of the 4770 genes annotated in that category. We also removed all genes which were not\nannotated in any category, resulting in a total of 3319 genes and 4246 categories.\n\nFigure 3 illustrates the distribution of smallest p-values achieved by each gene over all of their anno-\ntated functions. PCC is used as a baseline performance measure as it does not consider evolutionary\ninformation. We see that all evolutionary-based models outperform PCC in ranking genes with sim-\nilar function much closer on average. ECC-sequence performs worse than PCC, suggesting that\nexpression-based evolutionary metrics may provide additional information compared to those based\non sequence. The relative performance of BFPA versus Brownian re\ufb02ects an overall signi\ufb01cant per-\nformance gap between our models and the existing ones. A control measure ECC-random is shown,\nwhich is computed by randomizing the gene labels of the data in each of the \ufb01ve organisms before\nlearning. Finally, Brown+prior measures the performance of the Brownian model when the conser-\nvation pattern priors are allowed to be estimated, and performs better than the Brownian model but\nworse than BFPA, as expected. All differences between the distributions are statistically signi\ufb01cant,\nas all pairwise p-values computed by the Kolmogorov-Smirnov test are less than 10\u22126.\n\n6\n\n123401000200030004000# of conservation patterns# genesnone234all0500100015002000# conserved species# genes\fFigure 3: Model performance. (left) A reverse cumulative distribution plot of p-values obtained\nfrom applying the Wilcoxon Rank Sum test using either a PCC or ECC-based similarity metric. The\nsmallest p-value achieved for each gene across all its annotated functions is used in the distribution.\nPosition (x, y) indicates that for y genes, their p-value was less than 10\u2212x. Higher lines on the graph\ntranslate into stronger associations between expression levels and gene function, which we interpret\nas better performance. (right) This graph shows the difference in the total number of expression\nvalues for which a particular method achieves the lowest error, sorted by species.\n\n6.2 Reconstruction of gene expression levels\n\nHere we report the performance of our model in predicting the expression level of a gene in each\nof human, mouse, chicken, and frog, given its expression levels in the other species. Tetraodon is\nnot predicted because it acts as an outgroup in our model. The model was trained using 100 EM\niterations on half of the dataset, which was then used to predict the expression levels for each gene\nin each species in the other half of the dataset, and vice versa. To create a baseline performance\nmeasure, we computed the error when using an average of the four other species to predict the\nexpression level of a gene in the \ufb01fth species. We only compute predictions for the ten matched\nsamples across all species so that we can compare errors made by our model against those of Brow-\nnian and the baseline, which require matched samples. Figure 3 shows that with the exception of\nthe comparison against Brownian in chicken, BFPA achieves lower error than both Brownian and\nbaseline in predicting expression measurements.\n\n7 Discussion\n\nWe have presented a new model for the simultaneous evolution of gene expression levels across mul-\ntiple tissues and organs. Given expression data from present-day species, our model can be used to\nsimultaneously infer the ancestral expression levels of orthologous genes as well as determine where\nin the phylogeny the gene expression levels underwent substantial change. BFPA extends previous\nBrownian models [8, 9] by introducing a constrained factor analysis framework to account for com-\nplex tissue relationships between different species and by adapting the discrete gamma method [13]\nto model quantitative gene expression data. Our model performs better than other Brownian models\nin functional association and expression prediction experiments, demonstrating that the evolution-\nary history we infer better recovers the function of the gene. We have shown that this is in large\npart due to our ability to consider species-speci\ufb01c tissue measurements, a feature not implemented\nin any existing model to the best of our knowledge. We also showed that gene expression-based\nphylogenetic data may provide information not contained in sequence-based phylogenetic data in\nterms of helping predict the functional association of genes.\n\nOur model has a number of other applications outside of using it to study the evolutionary history of\ngene expression. Our ability to identify genes with conserved expression across multiple species will\nhelp in the inference of gene function from annotated to non-annotated species because unconserved\nexpression patterns indicate a likely change in the biological function of a gene. We also expect\nthat by identifying species that share a conserved expression pattern, our model will aid in the\n\n7\n\n51015050010001500200025003000\u2212log10(pvalue)# genes BFPABrown+priorBrownPCCECC\u2212sequenceECC\u2212randomHMCF0500100015002000difference in winsspecies [BFPA]wins \u2212 [Brown]wins[BFPA]wins \u2212 [baseline]wins\fidenti\ufb01cation of transcriptional cis-regulatory elements by focusing the search for cis-elements to\nthose species identi\ufb01ed as conserved in expression.\n\nWhile we have taken different pro\ufb01led samples as representing different tissues, our methodology\ncan be easily expanded to study evolutionary change in gene expression in response to different\ngrowth conditions or environmental stresses, as with those studied in [5]. Our methodology is\nalso easily extendible to other model organisms for which there are genomes and expression data\nfor multiple closely related species (e.g. yeast, worm, \ufb02y, plants). We anticipate that the results\nobtained will be invaluable in the study of genome evolution and identi\ufb01cation of cis-regulatory\nelements, whose phylogeny should re\ufb02ect that of the gene expression patterns.\n\nAll data used in this publication can be obtained by a request to the authors.\n\nReferences\n\n[1] Li, W., Yang, J., Gu, X. (2005) Expression divergence between duplicate genes. Trends Genet., 21, 602-607.\n\n[2] Chen, K., Rajewsky, N. (2007) The evolution of gene regulation by transcription factors and microRNAs.\nNature Rev. Genet., 8, 93-103.\n\n[3] Yergeau, D.A. et al. (2005) bloodthirsty, an RBCC/TRIM gene required for erythropoiesis in zebra\ufb01sh.\nDev. Biol., 283, 97-112.\n\n[4] Stuart, J.M., Segal, E., Koller, D., Kim, S.K. (2003) A gene-coexpression network for global discovery of\nconserved genetic modules. Science, 302, 249-255.\n\n[5] Tirosh, I., Weinberger, A., Carmi, M., Barkai, N. (2006) A genetic signature of interspecies variations in\ngene expression. Nat. Genet., 38, 830-834.\n\n[6] Khaitovich, P. et al. (2005) A neutral model of transcriptome evolution. PLoS. Biol., 2, 682-689.\n\n[7] Ghahramani, Z., & Hinton, G.E. (1996) The EM algorithm for mixtures of factor analyzers. Technical\nReport CRG-TR-96-2, University of Toronto.\n\n[8] Gu, X. (2004) Statistical framework for phylogenomic analysis of gene family expression pro\ufb01les. Genetics,\n167, 531-542.\n\n[9] Oakley, T.H. et al. (2005) Comparative methods for the analysis of gene-expression evolution: an example\nusing yeast functional genomic data. Mol. Biol. Evol., 22, 40-50.\n\n[10] Felsenstein, J. (2004) Inferring phylogenies. Sunderland (Massachusetts): Sinauer Associates. 664 p.\n\n[11] Felsenstein, J. (1981) Evolutionary trees from gene-frequencies and quantitative characters - \ufb01nding max-\nimum likelihood estimates. Evolution, 35, 1229-1242.\n\n[12] Khaitovich et al. (2006) Evolution of primate gene expression. Nat. Rev. Genet., 7, 693-702.\n\n[13] Yang, Z. (1994) Maximum likelihood phylogenetic estimation from DNA sequences with variable rates\nover sites: approximate methods. J. Mol. Evol., 39, 306-314.\n\n[14] Kardong, K.V. (2006) Vertebrates: comparative anatomy, function, evolution. McGraw-Hill. 782 p.\n\n[15] Zhang, W., Morris, Q.D. et al. (2004) The functional landscape of mouse gene expression. J. Biol., 3, 21.\n\n[16] Huber, W. et al. (2002) Variance stabilization applied to microarray data calibration and to the quanti\ufb01ca-\ntion of differential expression. Bioinformatics, 18, S96-104.\n\n[17] The Gene Ontology Consortium. (2000) Gene Ontology: tool for the uni\ufb01cation of biology. Nature Genet.,\n25, 25-29.\n\n[18] Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput.\nNucleic Acids Res., 32, 1792-1797.\n\n[19] Yang, Z. (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol., 24, 1586-1591.\n\n8\n\n\f", "award": [], "sourceid": 517, "authors": [{"given_name": "Gerald", "family_name": "Quon", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}, {"given_name": "Esther", "family_name": "Chan", "institution": null}, {"given_name": "Timothy", "family_name": "Hughes", "institution": null}, {"given_name": "Michael", "family_name": "Brudno", "institution": null}, {"given_name": "Quaid", "family_name": "Morris", "institution": null}]}