{"title": "Inferring Network Structure from Co-Occurrences", "book": "Advances in Neural Information Processing Systems", "page_first": 1105, "page_last": 1112, "abstract": "", "full_text": "Inferring Network Structure from Co-Occurrences\r\n\r\nMichael G. Rabbat Electrical and Computer Eng. University of Wisconsin Madison, WI 53706 rabbat@cae.wisc.edu\r\n\r\n Mario A.T. Figueiredo Instituto de Telecomunicacoes ~ Instituto Superior Tecnico  Lisboa, Portugal mtf@lx.it.pt\r\n\r\nRobert D. Nowak Electrical and Computer Eng. University of Wisconsin Madison, WI 53706 nowak@ece.wisc.edu\r\n\r\nAbstract\r\nWe consider the problem of inferring the structure of a network from cooccurrence data: observations that indicate which nodes occur in a signaling pathway but do not directly reveal node order within the pathway. This problem is motivated by network inference problems arising in computational biology and communication systems, in which it is difficult or impossible to obtain precise time ordering information. Without order information, every permutation of the activated nodes leads to a different feasible solution, resulting in combinatorial explosion of the feasible set. However, physical principles underlying most networked systems suggest that not all feasible solutions are equally likely. Intuitively, nodes that co-occur more frequently are probably more closely connected. Building on this intuition, we model path co-occurrences as randomly shuffled samples of a random walk on the network. We derive a computationally efficient network inference algorithm and, via novel concentration inequalities for importance sampling estimators, prove that a polynomial complexity Monte Carlo version of the algorithm converges with high probability.\r\n\r\n1 Introduction\r\nThe study of complex networked systems is an emerging field impacting nearly every area of engineering and science, including the important domains of biology, cognitive science, sociology, and telecommunications. Inferring the structure of signalling networks from experimental data precedes any such analysis and is thus a basic and fundamental task. Measurements which directly reveal network structure are often beyond experimental capabilities or are excessively expensive. This paper addresses the problem of inferring the structure of a network from co-occurrence data: observations which indicate nodes that are activated in each of a set of signaling pathways but do not directly reveal the order of nodes within each pathway. Co-occurrence observations arise naturally in a number of interesting contexts, including biological and communication networks, and networks of neuronal colonies. Biological signal transduction networks describe fundamental cell functions and responses to environmental stress [1]. Although it is possible to test for individual, localized interactions between gene pairs, this approach (called genetic epistatic analysis) is expensive and time-consuming. Highthroughput measurement techniques such as microarrays have successfully been used to identify the components of different signal transduction pathways [2]. However, microarray data only reflects order information at a very coarse, unreliable level. Developing computational techniques for inferring pathway orders is a largely unexplored research area [3]. A similar problem has been studied in telecommunication networks [4]. In this context, each path corresponds to a transmission between an origin and destination. The origin and destination are observed, in addition to the activated switches/routers carrying the transmission through the network.\r\n\r\n\f\r\nHowever, due to the geographically distributed nature of the measurement infrastructure and the rapidity at which transmissions are completed, it is not possible to obtain precise ordering information. Another exciting potential application arises in neuroimaging [5, 6]. Functional magnetic resonance imaging provides images of brain activity with high spatial resolution but has relatively poor temporal resolution. Treating distinct brain regions as nodes in a functional brain network that co-activate when a subject performs different tasks may lead to a similar network inference problem. Given a collection of co-occurrences, a feasible network (consistent with the observations) is easily obtained by assigning an order to the elements of each co-occurrence, thereby specifying a path through the hypothesized network. Since any arbitrary order of each co-occurrence leads to a feasible network, the number of feasible solutions is proportional to the number of permutations of all the co-occurrence observations. Consequently we are faced with combinatorial explosion of the feasible set, and without additional assumptions or side information there is no reason to prefer one particular feasible network over the others. See the supplementary document [7] for further discussion. Despite the apparent intractability of the problem, physical principles governing most networks suggest that not all feasible solutions are equally plausible. Intuitively, nodes that co-occur more frequently are more likely to be connected in the underlying network. This intuition has been used as a stepping stone by recent approaches proposed in the context of telecommunications [4], and in learning networks of collaborators [8]. However, because of their heuristic nature, these approaches do not produce easily interpreted results and do not readily lend themselves to analysis or to the incorporation of side information. In this paper, we model co-occurrences as randomly permuted samples of a random walk on the underlying network. The random permutation accounts for lack of observed order. We refer to this process as the shuffled Markov model. In this framework, network inference amounts to maximum likelihood estimation of the parameters governing the random walk (initial state distribution and transition matrix). Direct maximization is intractable due to the highly non-convex log-likelihood function and exponential feasible set arising from simultaneously considering all permutations of all co-occurrences. Instead, we derive a computationally efficient EM algorithm, treating the random permutations as hidden variables. In this framework the likelihood factorizes with respect to each pathway/observation, so that the computational complexity of the EM algorithm is determined by the E-step which is only exponential in the longest path. In order to handle networks with long paths, we propose a Monte Carlo E-step based on a simple, linear complexity importance sampling scheme. Whereas the exact E-step has computational complexity which is exponential in path length, we prove that a polynomial number of importance samples suffices to retain desirable convergence properties of the EM algorithm with high probability. In this sense, our Monte Carlo EM algorithm breaks the curse of dimensionality using randomness. It is worth noting that the approach described here differs considerably from that of learning the structure of a directed graphical model or Bayesian network [9, 10]. The aim of graphical modelling is to find a graph corresponding to a factorization of a high-dimensional distribution which predicts the observations well. These probabilistic models do not directly reflect physical structures, and applying such an approach to co-occurrences would ignore physical constraints inherent to the observations: co-occurring vertices must lie along a path in the network.\r\n\r\n2 Model Formulation and EM Algorithm\r\n2.1 The Shuffled Markov Model\r\n\r\nWe model a network as a directed graph G = (V , E ), where V = {1, . . . , |V |} is the vertex (node) set and E  V 2 is the set of edges (direct connections between vertices). An observation, y  V , is a subset of vertices co-activated when a particular stimulus is applied to the network (e.g., collection of signaling proteins activated in response to an environmental stress). Given a set of T (m) (m) observations, Y = {y(1), . . . , y(T ) }, each corresponding to a path, where y(m) = {y1 , . . . , yNm }, we say that a graph (V , E ) is feasible w.r.t. Y if for each y(m)  Y there is an ordered path (m) (m) (m) (m) (m) (m) z(m) = (z1 , . . . , zNm ) and a permutation  (m) = (1 , . . . , Nm ) such that zt = y (m) , and\r\nt\r\n\r\n(zt-1 , zt )  E , for t = 2, ..., Nm .\r\n\r\n\f\r\nThe (unobserved) ordered paths, Z = {z(1) , ..., z(T ) }, are modelled as T independent samples of a first-order Markov chain with state set V . The Markov chain is parameterized by the initial state distribution  and the (stochastic) transition matrix A. We assume that the support of the transition matrix is determined by the adjacency structure of the graph; i.e., Ai,j > 0  (i, j )  E . Each observation y(m) results from shuffling the elements of z(m) via an unobserved permutation  (m) , (m) (m) drawn uniformly from SNm (the set of all permutations of Nm objects); i.e., zt = y (m) , for\r\nt\r\n\r\nt = 1, . . . , Nm . All the  (m) are assumed mutually independent and independent of all the z(m) . Under this model, the log-likelihood of the set of observations Y is    mT  log  log P [Y |A,  ] = P [y(m) | , A,  ] - log(Nm !) . (1)\r\n=1 SNm\r\n\r\nwhere P [y| , A,  ] = y1 t=2 Ayt-1 ,yt , and network inference consists in computing the maximum likelihood (ML) estimates (AML ,  ML ) = arg maxA, log P [Y |A,  ]. With the ML estimates in hand, we may determine the most likely permutation for each y(m) and obtain a feasible reconstruction from the ordered paths. In general, log P [Y |A,  ] is a non-concave function of (A,  ), so finding (AML ,  ML ) is not easy. Next, we derive an EM algorithm for this purpose, by treating the permutations as missing data. 2.2 EM Algorithm\r\n(m) (m) (m)\r\n\r\nN\r\n\r\nLet w(m) = (w1 , ..., wNm ) be a binary representation of z(m) , defined by wt ..., wt,|V | )  {0, 1}|V | , with (wt,i\r\n(m) (m)\r\n\r\n= (wt,1 ,\r\n\r\n(m)\r\n\r\n= 1)  (zt\r\n\r\n(m)\r\n\r\n= i); let W = {w(1) , ..., w(T ) }. Let\r\n\r\nX = {x(1) , . . . , x(T ) } be the binary representation for Y , defined in a similar way: x(m) = (m) (m) (m) (m) (m) (m) (m) (x1 , ..., xNm ), where xt = (xt,1 , ..., xt,|V | )  {0, 1}|V | , with (xt,i = 1)  (yt = i). Finally, let R = {r(1) , . . . , r(T ) } be the collection of permutation matrices corresponding to (m) (m) = t ). With this notation in place, the comT = { (1) , . . . ,  (T ) }; i.e., (rt,t = 1)  (t plete log-likelihood can be written as log P [X , R|A,  ] = log P [X |R, A,  ] + log P [R], where log P [X |R, A,  ] = mT\r\n=1\r\n\r\nmT\r\n=1\r\n\r\nlog P [x(m) |r(m) , A,  ]\r\n(m) (m) xt ,i xt ,j\r\n\r\n=\r\n\r\ni |V | t\r\n,j =1\r\n\r\nNm\r\n,t =1\r\n\r\nN tm =2\r\n\r\n(m) (m) rt,t rt-1,t\r\n\r\nlog Ai,j +\r\n\r\nmT i|V | tNm\r\n=1 =1\r\n=1\r\n\r\nr1,t\r\n\r\n(m) (m) xt ,i\r\n\r\nlog i ,\r\n\r\n(2)\r\n\r\nand P [R] is the probability of the set of permutations, which is constant and thus dropped, since the permutations are independent and equiprobable. A = Thel EM algorithmX proceeds, by (the E-step) computing Q ,  ; Ak ,  k E og P [X , R|A,  ] , Ak ,  k the expected value of log P [X , R|A,  ] w.r.t. the missing R, conditioned on the observations and on the current model estimate (Ak ,  k ). Examining log P [X , R|A,  ] reveals that it is linear w.r.t. simple functions of R: (a) the first row of each Nm (m) (m) (m) (m) r(m) , i.e., r1,t ; (b) sums of transition indicators, i.e., t ,t  t=2 rt,t rt-1,t . Consequently,\r\na\r\n\r\nthe E-step reduces to computing the conditional expectations of r1,t\r\n(m)\r\n\r\n(m)\r\na\r\n\r\nnd t ,t\r\n\r\n(m)\r\n,\r\n\r\ndenoted r1,t \r\n\r\n(m)\r\n\r\nnd t ,t , respectively, and plugging them into the complete log-likelihood (2), which yields  A . Q ,  ; Ak ,  k r(m) Since the permutations are (a priori) equiprobable, we have P [r(m) ] = (Nm !)-1 , P 1,t = 1] = (Nm - 1)!/Nm ! = 1/Nm , and P [r(m) |r1,t = 1] = 1/(Nm - 1)!. Using these facts, the mutual independence among different observations, and Bayes law, it is not hard to show that r1,t \r\n(m)\r\n= N\r\n\r\n(m)\r\n\r\nt\r\n\r\n(m) (m) t\r\n\r\nt\r\n\r\nm =1\r\n\r\nith\r\nw\r\n\r\nt\r\n\r\n(m)\r\n=\r\n\r\nr\r\n: r1,t\r\n=1\r\n\r\nP\r\n\r\nx(m) r k k , ,A ,\r\n\r\n(3)\r\n\r\n\f\r\nwhere each term P\r\n\r\nx(m) r k k i , A ,  s easily computed after using r to \"unshuffle\" x(m) : .\r\n\r\nN x(m) r k k = y(m)  k k = k t m k P ,A , ,A , Ay(m) P y(m)\r\n1\r\n\r\n=2\r\n\r\n(m) t-1 ,yt\r\n\r\nThe computation of t ,t \r\n(m)\r\n\r\n(m)\r\ni\r\n\r\ns similar to that of r1,t \r\n(m) (m) (m) |rt,t rt-1,t\r\n=\r\n\r\n(m)\r\n;\r\n\r\nthe key observations are that P [rt,t\r\n\r\n(m) (m) rt-1,t\r\n\r\n=\r\n\r\n1] = (Nm - 2)!/Nm ! and P [r\r\n(m) t ,t \r\nN =\r\n\r\n1] = 1/(Nm - 2)!, leading to r\r\n=\r\n\r\nt ,t\r\nm =1\r\n\r\nt\r\n\r\nt\r\n\r\n(m)\r\n,\r\n\r\nwith\r\n\r\n(m) t ,t\r\n\r\nP [x\r\n\r\n(m)\r\n\r\n|r, A ,  ]\r\n\r\nk\r\n\r\nk\r\n\r\nN tm =2\r\n\r\nrt,t\r\n\r\nrt-1,t\r\n\r\n.\r\n\r\n(4)\r\n\r\nN o (m) (m) Computing {r1,t } and {t ,t } requires O m ! perations. For large Nm , this is a heavy load; in   Section 3, we describe a sampling approach for computing approximations to r1,t and t ,t .   A w Maximization of Q ,  ; Ak ,  k .r.t. A and  , under the normalization constraints, leads to the M-step: T Nm T Nm (m) (m) (m) (m) (m)   m=1 t ,t =1 t ,t xt ,i xt ,j m=1 t =1 r1,t xt ,i k+1 k+1 and i = |S | Ai,j = |S | T T Nm Nm (m) (m) . (m) (m) (m)   j =1 i=1 m=1 m=1 t ,t =1 t ,t xt ,i xt ,j t =1 r1,t xt ,i (5) Standard convergence results for the EM algorithm due to Boyles and Wu [11, 12] guarantee that the sequence {(Ak ,  k )} converges monotonically to a local maximum of the likelihood. 2.3 Handling Known Endpoints In some applications, (one or both of) the endpoints of each path are known and only the internal nodes are shuffled. For example, in telecommunications problems, the origin and destination of each transmission are known, but not the network connectivity. In estimating biological signal transduction pathways, a physical stimulus (e.g., hypotonic shock) causes a sequence of protein interactions, resulting in another observable physical response (e.g., a change in cell wall structure); in this case, the stimulus and response act as fixed endpoints, the goal is to infer the order of the sequence of protein interactions. Knowledge of the endpoints of each path imposes the constraints (m) (m) r1,1 = 1 and rNm ,Nm = 1. Under the first constraint, estimates of the initial state probabilities T (m) 1 are simply given by i = T m=1 x1,i . Thus, EM only needs to be used to estimate A. In this setup, the E-step has a similar form as (4) but with sums over r replaced by sums over permutation matrices satisfying r1,1 = 1 and rN ,N = 1. The M-step update for Ak+1 remains unchanged.\r\n\r\n3 Large Scale Inference via Importance Sampling\r\nFor long paths, the combinatorial nature of the exact E-step  summing over all permutations of each sequence in (3) and (4)  may render exact computation intractable. This section presents a Monte Carlo importance sampling (see, e.g., [13]) version of the E-step, along with finite sample bounds guaranteeing that a polynomial complexity Monte Carlo EM algorithm retains desirable convergence properties of the EM algorithm; i.e., monotonic convergence to a local maximum. 3.1 Monte Carlo E-Step by Importance Sampling\r\n\r\nTo lighten notation in this section we drop the superscripts from (Ak ,  k ), using simply (A,  ) (m) (m) for the current parameter estimates. Moreover, since the statistics t ,t and r1,t depend only   on the mth co-activation observation, y(m) , we focus on a particular length-N path observation y = (y1 , y2 , . . . , yN ) and drop the superscript (m). A nave Monte Carlo approximation would be based on random permutations sampled from the i uniform distribution on SN . However, the reason we resort to approximation techniques in the first\r\n\r\n\f\r\nplace is that SN is large, but typically only a small fraction of its elements have non-negligible posterior probability, P [ |y, A,  ]. Although we would ideally sample directly from the posterior, this would require determining its value for all N ! permutations. Instead, we propose the following sequential scheme for sampling a permutation using the current parameter estimates, (A,  ). To ensure the same element is not sampled twice we introduce a vector of binary flags, f = (f1 , f2 , . . . , f|V | )  {0, 1}|V | . Given a probability distribution p = (p1 , p2 , . . . , p|V | ) on the vertex set, V , denote by p|f the restriction of p to those elements i  V for which fi = 1; i.e., (p|f )i = pi fi , |V | j =1 pj fj for i = 1, 2, . . . , |V |. (6)\r\n\r\nOur sampling scheme proceeds as follows: Step 1: Initialize f so that fi = 1 if yt = i for some t = 1, . . . , N , and fi = 0 otherwise. Sample an element v from V according to the distribution  |f on V . Find t such that yt = v . Set 1 = t. Set fv = 0 to prevent yt from being sampled again (ensure  is a permutation). Set i = 2. Step 2: Let Av denote the v th row of the transition matrix. Sample an element v from V according to the distribution Av |f on V . Find t such that yt = v . Set i = t. Set fv = 0. Step 3: While i < N , update v  v\r\na\r\n\r\nnd i  i + 1 and repeat Step 2; otherwise, stop.\r\n\r\nRepeating this sampling procedure L times yields a collection of iid permutations  1 ,  2 , . . . ,  L , where the superscript now identifies the sample number; the corresponding permutation matrices are r1 , r2 , . . . , rL . Samples generated according to the scheme described above are drawn from a distribution R[ |x, A,  ] on SN which is different from the posterior P [ |x, A,  ]. Importance sample estimates correct for this disparity and are given by the expressions L L 1 t t L L 2 r ,t r -1,t N =1 u r ,t =1 u t= ,t = = r1,t nd t (7) a , =1 u =1 u where the correction factor (or weight) for sample r is given by u\r\n=\r\n\r\nP [ |y, A,  ] t t P [r |x, A,  ] = = Ayt ,yt R[r |x, A,  ] R[ |y, A,  ] -1 =2 =\r\nt\r\n\r\nN\r\n\r\nN\r\n\r\n(8)\r\n.\r\n\r\nA detailed derivation of the exact form of the induced distribution, R, and the correction factor, u , based on the sequential nature of the sampling scheme, along with further discussion and comparison with alternative sampling schemes can be found in the supplementary document [7]. In fact, terms in the product (8) are readily available as a byproduct of Step 2 (denominator of Av |f ). 3.2 Monotonicity and Convergence\r\n\r\nStandard EM convergence results directly apply when the exact E-step is used [11, 12]. Let  k = (Ak ,  k ). By choosing  k+1 according to (5) we have  k+1 = arg max Q( ;  k ), and the monotonicity property, Q( k+1 ;  k )  Q( k ;  k ), is satisfied. Together with the fact that the marginal log-likelihood (1) is continuous in  and bounded above, the monotonicity property guarantees that the exact EM iterates converge monotonically to a local maximum of log P [Y | ]. When the Monte Carlo E-step is used, we no longer have monotonicity since now the M-step solves  k+1 = arg max Q( ;  k ), where Q is defined analogously to Q but with (m) and r(m) replaced  t ,t 1,t  k+1 k k k (m) (m) by t ,t and r1,t ; for monotonicity we need Q( ;  )  Q( ;  ). To assure the Monte Carlo EM algorithm (MCEM) converges, the number of importance samples, L, must be chosen carefully so that Q approximates Q well enough; otherwise the MCEM may be swamped with error. Recently, Caffo et al. [14] have proposed a method, based on central limit theorem-like arguments, for automatically adapting the number of Monte Carlo samples used at each EM iteration. They\r\n\r\n\f\r\nguarantee what we refer to as an ( ,  )-probably approximately monotonic (PAM) update, stating k+1 k k k that Q( ;  )  Q( ;  ) - , with probability at least 1 -  . Rather than resorting to asymptotic approximations, we take advantage of the specific form of Q k+1 k in our problem to obtain the finite-sample PAM result below. Because Q( ;  ) involves terms k k log Ak,j and log i , in practice we bound Ak,j and i away from zero to ensure that Q does not i i blow up. Specifically, we assume a small positive constant min so that Ak  min and  k  min .\r\ni,j i\r\n\r\nTheorem 1 Let ,  > 0 be given. There exist finite constants bm > 0, independent of Nm , so that if 22 ( 4 2b2 T 2 Nm | log min |2 Nm m Lm = log 9) 2 1 - (1 -  )1/T importance samples are used for the mth observation, then Q( probability greater than 1 -  .\r\nk+1\r\n\r\n;  )  Q( ;  ) - , with\r\n\r\nk\r\n\r\nk\r\n\r\nk\r\n\r\nThe proof involves two key steps. First, we derive finite sample concentration-style bounds for (m) (m) the importance sample estimates showing, e.g., that t ,t converges to t ,t at a rate which is  exponential in the number of importance samples used. These bounds are based on rather novel concentration inequalities for importance sampling estimators, which may be of interest in their own right (see the supplementary document [7] for details). Then, accounting for the explicit form of Q in our problem, the result follows from application of the union bound and the assumptions that Ak ,  k  min . In fact, by making a slightly stronger assumption it can be shown that the MCEM i,j i update is probably monotonic (i.e., (0,  )-PAM, not approximately monotonic) if Lm importance samples are used for the mth observation, where Lm also depends polynomially on Nm and T . See the supplementary document [7] for further discussion and for the full proof of Theorem 1. Recall that exact E-step computation requires Nm ! operations for the mth observation (enumerating all permutations). The bound above stipulates that the number of importance samples required for a 2 4 PAM update is on the order of Nm log Nm . Generating one importance sample using the sequential procedure described above requires Nm operations. In contrast to the (exponential complexity) exact EM algorithm, this clearly demonstrates that the MCEM converges with high probability while only having polynomial computational complexity, and, in this sense, the MCEM meaningfully breaks the curse of dimensionality by using randomness to preserve the monotonic convergence property.\r\n\r\n4 Experimental Results\r\nThe performance of our algorithm for network inference from co-occurrences (NICO, pronounced \"nee-koh\") has been evaluated on both simulated data and on a biological data set. In these experiments, network structure is inferred by first executing the EM algorithm to infer the parameters (A,  ) of a Markov chain. Then, inserting edges in the inferred graph based on the most likely order of each path according to (A,  ) ensures the resulting graph is feasible with respect to the observations. Because the EM algorithm is only guaranteed to converge to a local maximum, we rerun the algorithm from multiple random initializations and chose the mostly likely of these solutions. To gauge the performance of our algorithm we use the edge symmetric difference error: the total number of false positives (edges in the inferred network which do not exist in the true network) plus the number of false negatives (edges in the true network not appearing in the inferred network). We simulate co-occurrence observations in the following fashion. A random graph on 50 vertices is sampled. Disjoint sets of vertices are randomly chosen as path origins and destinations, paths are generated between each origin-destination pair using the shortest path algorithm with either unit weight per edge (\"shortest path\") or a random weight on each edge (\"random routing\"), and then co-occurrence observations are formed from each path. We keep the number of origins fixed at 5 and vary the number of destinations between 5 and 40 to see how the number of observations effects performance. NICO performance is compared against the frequency method (FM) described in [4]. Figure 1 plots the edge error for synthetic data generated using (a) shortest path routing, and (b) random routing. Each curve is the average performance over 100 different network and path real-\r\n\r\n\f\r\n7 6 Edge Symmetric Difference 5 4 3 2 1 0 5\r\n\r\nFreq. Method (Sparsest) Freq. Method (Best) NICO (ML) Edge Symmetric Difference\r\n\r\n7 6 5 4 3 2 1 0 5\r\n\r\nFreq. Method (Sparsest) Freq. Method (Best) NICO (ML)\r\n\r\n10\r\n\r\n15\r\n\r\n20 25 Num. Destinations\r\n\r\n30\r\n\r\n35\r\n\r\n40\r\n\r\n10\r\n\r\n15\r\n\r\n20 25 Num. Destinations\r\n\r\n30\r\n\r\n35\r\n\r\n40\r\n\r\n(a) Shortest path routes\r\n\r\n(b) Random routes\r\n\r\nFigure 1: Edge symmetric differences between inferred networks and the network one would obtain using co-occurrence measurements arranged in the correct order. Performance is averaged over 100 different network realizations. For each configuration 10 NICO and FM solutions are obtained via different initializations. We then choose the NICO solution yielding the largest likelihood, and compare with both the sparsest (fewest edges) and clairvoyant best (lowest error) FM solution.\r\n\r\nizations. For each network/path realization, the EM algorithm is executed with 10 random initializations. Exact E-step calculation is used for observations with Nm  12, and importance sampling is used for longer paths. The longest observation in our data has Nm = 19. The FM uses simple pairwise frequencies of co-occurrence to assign an order independently to each path observation. Of the 10 NICO solutions (different random initializations), we use the one based on parameter estimates yielding the highest likelihood score which also always gives the best performance. Because it is a heuristic, the FM does not provide a similar mechanism for ranking solutions from different initializations. We plot FM performance for two schemes; one based on choosing the sparsest FM solution (the one with the fewest edges), and one based on clairvoyantly choosing the FM solution with lowest error. NICO consistently outperforms even the clairvoyant best FM solution. Our method has also been applied to infer the stress-activated protein kinease (SAPK)/Jun N terminal kinase (JNK) and NFB signal transduction pathways1 (biological networks). The clustering procedure described in [2] is applied to microarray data in order to identify 18 co-occurrences arising from different environmental stresses or growth factors (path source) and terminating in the production of SAPK/JNK or NFB proteins. The reconstructed network (combined SAPK/JNK and NFB signal transduction pathways) is depicted in Figure 2. This structure agrees with the signalling pathways identified using traditional experimental techniques which test individually for each possible edge (e.g., \"MAPK\" and \"NF-B Signaling\" on http://www.cellsignal.com).\r\n\r\n5 Conclusion\r\nThis paper describes a probabilistic model and statistical inference procedure for inferring network structure from incomplete \"co-occurrence\" measurements. Co-occurrences are modelled as samples of a first-order Markov chain subjected to a random permutation. We describe exact and Monte Carlo EM algorithms for calculating maximum likelihood estimates of the Markov chain parameters (initial state distribution and transition matrix), treating the random permutations as hidden variables. Standard results for the EM algorithm guarantee convergence to a local maximum. Although our exact EM algorithm has exponential computational complexity, we provide finite-sample bounds guaranteeing convergence of the Monte Carlo EM variation to a local maximum with high probability and with only polynomial complexity. Our algorithm is easily extended to compute maximum a posteriori estimates, applying a Dirichlet prior to the initial state distribution and to each row of the Markov transition matrix.\r\n1 NFB proteins control genes regulating a broad range of biological processes including innate and adaptive immunity, inflammation and B cell development. The NFB pathway is a collection of paths activated by various environmental stresses and growth factors, and terminating in the production of NFB.\r\n\r\n\f\r\nLT\r\n\r\nNIK\r\n\r\nAg\r\n\r\nPI3K ArtCot\r\n\r\nAgMHC\r\n\r\nPLCgamma2\r\n\r\nPKC\r\n\r\nMALT1 TRAF6 TAK1 IKK\r\n\r\nRHO CS2 RAC\r\n\r\nIL1\r\n\r\nNFkappaBC1 dsRNA GF RAS CS1 CDC42 PKR bTrCP\r\n\r\nNFkappaBC2\r\n\r\nNFKappaB\r\n\r\nMEKK HPK GCKs\r\n\r\nMKK\r\n\r\nJNK\r\n\r\nUV FAS TNF ASK1\r\n\r\nOS\r\n\r\nFigure 2: Inferred topology of the combined SAPK/JNK and NFB signal transduction pathways. Co-occurrences are obtained from gene expression data via the clustering algorithm described in [2], and then network is inferred using NICO. Acknowledgments The authors of this paper would like to thank D. Zhu and A.O. Hero for providing the data and collaborating on the biological network experiment reported in Section 4. This work was supported in part by the Portuguese Foundation for Science and Technology grant POSC/EEA-SRI/61924/2004, the Directorate of National Intelligence, and National Science Foundation grants CCF-0353079 and CCR-0350213. References\r\n[1] E. Klipp, R. Herwig, A. Kowald, C. Wierling, and H. Lehrach. Systems Biology in Practice: Concepts, Implementation and Application. John Wiley & Sons, 2005. [2] D. Zhu, A. O. Hero, H. Cheng, R. Khanna, and A. Swaroop. Network constrained clustering for gene microarray data. Bioinformatics, 21(21):40144020, 2005. [3] Y. Liu and H. Zhao. A computational approach for ordering signal transduction pathway components from genomics and proteomics data. BMC Bioinformatics, 5(158), October 2004. [4] M. G. Rabbat, J. R. Treichler, S. L. Wood, and M. G. Larimore. Understanding the topology of a telephone network via internally-sensed network tomography. In Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. [5] O. Sporns and G. Tononi. Classes of network connectivity and dynamics. Complexity, 7(1):2838, 2002. [6] O. Sporns, D. R. Chialvo, M. Kaiser, and C. C. Hilgetag. Organization, development and function of complex brain networks. Trends in Cognitive Science, 8(9), 2004. [7] M.G. Rabbat, M.A.T. Figueiredo, and R.D. Nowak. Supplement to inferring network structure from co-occurrences. Technical report, University of Wisconsin-Madison, October 2006. [8] J. Kubica, A. Moore, D. Cohn, and J. Schneider. cGraph: A fast graph-based method for link analysis and queries. In Proc. IJCAI Text-Mining and Link-Analysis Workshop, Acapulco, Mexico, August 2003. [9] D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197243, 1995. [10] N. Friedman and D. Koller. Being Bayesian about Bayesian network structure: A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50(12):95125, 2003. [11] R. A. Boyles. On the convergence of the EM algorithm. J. Royal Statistical Society B, 45(1):4750, 1983. [12] C. F. J. Wu. On the convergence properties of the EM algorithm. Ann. of Statistics, 11(1):95103, 1983. [13] C. Robert and G. Casella. Monte Carlo Statistical Methods. Springer Verlag, New York, 1999. [14] B. S. Caffo, W. Jank, and G. L. Jones. Ascent-based Monte Carlo EM. J. Royal Statistical Society B, 67(2):235252, 2005.\r\n\r\n\f\r\n", "award": [], "sourceid": 3057, "authors": [{"given_name": "Michael", "family_name": "Rabbat", "institution": null}, {"given_name": "M\u00e1rio", "family_name": "Figueiredo", "institution": null}, {"given_name": "Robert", "family_name": "Nowak", "institution": null}]}