#### Authors

Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, Douglas Brutlag

#### Abstract

Protein interactions typically arise from a physical interaction of one or more small sites on the surface of the two proteins. Identifying these sites is very important for drug and protein design. In this paper, we propose a computational method based on probabilistic relational model that at- tempts to address this task using high-throughput protein interaction data and a set of short sequence motifs. We learn the model using the EM algorithm, with a branch-and-bound algorithm as an approximate infer- ence for the E-step. Our method searches for motifs whose presence in a pair of interacting proteins can explain their observed interaction. It also tries to determine which motif pairs have high affinity, and can therefore lead to an interaction. We show that our method is more accurate than others at predicting new protein-protein interactions. More importantly, by examining solved structures of protein complexes, we find that 2/3 of the predicted active motifs correspond to actual interaction sites.

1 Introduction

Many cellular functions are carried out through physical interactions between proteins. Discovering the protein interaction map can therefore help to better understand the work- ings of the cell. Indeed, there has been much work recently on developing high-throughput methods to produce a more complete map of protein-protein interactions [1, 2, 3]. Interactions between two proteins arise from physical interactions between small re- gions on the surface of the proteins [4] (see Fig. 2(b)). Finding interaction sites is an important task, which is of particular relevance to drug design. There is currently no high- throughput experimental method to achieve this goal, so computational methods are re- quired. Existing methods either require solving a protein's 3D structure (e.g., [5]), and therefore are computationally very costly and not applicable on a genome-wide scale, or use known interaction sites as training data (e.g., [6]), which are relatively scarce and hence have poor coverage. Other work focuses on refining the highly noisy high-throughput inter- action maps [7, 8, 9], or on assessing the confidence levels of the observed interactions [10]. In this paper, we propose a computational method for predicting protein interactions

                                                                                     P1                          P2                          P5                                                     d                      a          d                                             A                A               A           A                     A                                                          P                          a          d                b          d                          d                                                               5                           P1

A                     A     A                A               A                                                                                    ab               db         ad               dd          bd

B          S B              B          S Bab                 B               S                                                                                                                ab                               ab                                                b                                   ab               db                                                                          c       d                                                  b                    T                           T                           T            P                                                       P4              12                          15                          25                 2                    P                                                   O                           O                           O                                           3       b

(a)                                                                            (b)


Figure 1: (a) Simple illustration of our assumptions for protein-protein interactions. The small elements denote motif occurrences on proteins, with red denoting active and gray denoting inactive motifs. (b) A fragment of our probabilistic model, for the proteins P1, P2, P5. We use yellow to denote an assignment of the value true, and black to denote the value false; full circles denote an assignment observed in the data, and patterned circles an assignment hypothesized by our algorithm. The dependencies involving inactive motif pairs were removed from the graph because they do not affect the rest of the model.

and the sites at which the interactions take place, which uses as input only high-throughput protein-protein interaction data and the protein sequences. In particular, our method as- sumes no knowledge of the 3D protein structure, or of the sites at which binding occurs. Our approach is based on the assumption that interaction sites can be described using a limited repertoire of conserved sequence motifs [11]. This is a reasonable assumption since interaction sites are significantly more conserved than the rest of the protein surface [12]. Given a protein interaction map, our method tries to explain the observed interactions by identifying a set of sites of motif occurrence on every pair of interacting proteins through which the interaction is mediated. To understand the intuition behind our approach, con- sider the example of Fig. 1(a). Here, the interaction pattern of the protein P1 can best be explained using the motif pair a, b, where a appears in P1 and b in the proteins P2, P3, P4 but not in P5. By contrast, the motif pair d, b is not as good an explanation, because d also appears in P5, which has a different interaction pattern. In general, our method aims to identify motif pairs that have high affinity, potentially leading to interaction between protein pairs that contain them. However, a sequence motif might be used for a different purpose, and not give rise to an active binding site; it might also be buried inside the protein, and thus be inaccessible for interaction. Thus, the appearance of an appropriate motif does not always imply interaction. A key feature of our approach is that we allow each motif occurrence in a protein to be either active or inactive. Interactions are then induced only by the interactions of high- affinity active motifs in the two proteins. Thus, in our example, the motif d in p2 is inactive, and hence does not lead to an interaction between p2 and p4, despite the affinity between the motif pair c, d. We note that Deng et al. [8] proposed a somewhat related method for genome-wide analysis of protein interaction data, based on protein domains. However, their method is focused on predicting protein-protein interactions and not on revealing the site of interaction, and they do not allow for the possibility that some domains are inactive. Our goal is thus to identify two components: the affinities between pairs of motifs, and the activity of the occurrences of motifs in different proteins. Our algorithm addresses this problem by using the framework of Bayesian networks [13] and probabilistic relational models [14], which allows us to handle the inherent noise in the protein interaction data and the uncertain relationship between interactions and motif pairs. We construct a model encoding our assumption that protein interactions are induced by the interactions of active motif pairs. We then use the EM algorithm [15], to fill in the details of the model, learning both the motif affinities and activities from the observed data of protein-protein interactions and protein motif occurrences. We address the computational complexity of the E-step in

these large, densely connected models by using an approximate inference procedure based on branch-and-bound. We evaluated our model on protein-protein interactions in yeast and Prosite motifs [11]. As a basic performance measure, we evaluated the ability of our method to predict new protein-protein interactions, showing that it achieves better performance than several other models. In particular, our results validate our assumption that we can explain interactions via the interactions of active sequence motifs. More importantly, we analyze the ability of our method to discover the mechanism by which the interaction occurs. Finally, we examined co-crystallized protein pairs where the 3D structure of the interaction is known, so that we can determine the sites at which the interaction took place. We show that our active motifs are more likely to participate in interactions.

2 The Probabilistic Model

The basic entities in our probabilistic model are the proteins and the set of sequence motifs that can mediate protein interactions. Our model therefore contains a set of protein entities P = {P1, . . . , Pn}, with the motifs that occur in them. Each protein P is associated with the set of motifs that occur in it, denoted by P.M . As we discussed, a key premise of our approach is that a specific occurrence of a sequence motif may or may not be active. Thus, each motif occurrence a P.M is associated with a binary-value variable P.Aa, which takes the value true if Aa is active in protein P and false otherwise. We structure the prior probability P (P.Aa = true) = min{0.8, 3+0.1|P.M| }, to capture our intuition that |P.M | the number of active motifs in a protein is roughly a constant fraction of the total number of motifs in the protein, but that even proteins with few motifs tend to have at least some number of active motifs. A pair of active motifs in two proteins can potentially bind and induce an interaction between the corresponding proteins. Thus, in our model, a pair of proteins interact if each contains an active motif, and this pair of motifs bind to each other. The probability with which two motifs bind to each other is called their affinity. We encode this assumption by including in our model entities Tij corresponding to a pair of proteins Pi, Pj. For each pair of motifs a Pi.M and b Pj.M , we introduce a variable Tij.Aab, which is a deterministic AND of the activity of these two motifs. Intuitively, this variable represents whether the pair of motifs can potentially interact. The probability with which two active motif occurrences bind is their affinity. We model the binding event between two motif occurrences using a variable Tij.Bab, and define: P (Tij.Bab = true | Tij.Aab = true) = ab and P (Tij.Bab = true | Tij.Aab = false) = 0, where ab is the affinity between motifs a and b. This model reflects our assumption that two motif occurrences can bind only if they are both active, but their actual binding probability depends on their affinity. Note that this affinity is a feature of the motif pair and does not depend on the proteins in which they appear. We must also account for interactions that are not explained by our set of motifs, whether because of false positives in the data, or because of inadequacies of our model or of our motif set. Thus, we add a spurious binding variable Tij.S, for cases where an interaction between Pi and Pj exists, but cannot be explained well by our set of active motifs. The probability that a spurious binding occurs is given by P (Tij.S = true) = S. Finally, we observe an interaction between two proteins if and only if some form of binding occurs, whether by a motif pair or a spurious binding. Thus, we define a variable Tij.O, which represents whether protein i was observed to interact with protein j, to be a deterministic OR of all the binding variables Tij.S and Tij.Bab. Overall, Tij.O is a noisy-OR [13] of all motif pair variables Tij.Aab. Note that our model accounts for both types of errors in the protein interaction data. False negatives (missing interactions) in the data are addressed through the fact that the presence of an active motif pair only implies that binding takes place with some probability. False positives (wrong interactions) in the data are addressed through the introduction of

the spurious interaction variables. The full model defines a joint probability distribution over the entire set of attributes:

  P (P.A, T.A, T.B, T.S, T.O) =                                         P (P                                                         i    aP                    i.Aa)                                                                     i .M                                            P (T                     aP                            ij .Aab | Pi.Aa, Pj .Ab)P (Tij .Bab | Tij .Aab)                            i .M,bPj .M             ij    P (Tij.S)P (Tij.O | Tij.B, Tij.S)


where each of these conditional probability distributions is as specified above. We use to denote the entire set of model parameters {a,b}a,b {S}. An instantiation of our probabilistic model is illustrated in Fig. 1(b).

3 Learning the Model

We now turn to the task of learning the model from the data. In a typical setting, we are given as input a protein interaction data set, specifying a set of proteins P and a set of observed interacting pairs T.O. We are also given a set of potentially relevant motifs, and the occurrences of these motifs in the different proteins in P. Thus, all the variables except for the O variables are hidden. Our learning task is thus twofold: we need to infer the values of the hidden variables, both the activity variables P.A, T.A, and the binding variables T.B, T.S; we also need to find a setting of the model parameters , which specify the motif affinities. We use a variant of the EM algorithm [15] to find both an assignment to the parameters , and an assignment to the motif variables P.A, which is a local maximum of the likelihood function P (T.O, P.A | ). Note that, to maximize this objective, we search for a MAP assignment to the motif activity variables, but sum out over the other hidden variables. This design decision is reasonable in our setting, where determining motif activities is an important goal; it is a key assumption for our computational procedure. As in most applications of EM, our main difficulty arises in the E-step, where we need to compute the distribution over the hidden variables given the settings of the observed variables and the current parameter settings. In our model, any two motif variables (both within the same protein and across different proteins) are correlated, as there exists a path of influence between them in the underlying Bayesian network (see Fig. 1(c)). These cor- relations make the task of computing the posterior distribution over the hidden variables intractable, and we must resort to an approximate computation. While we could apply a general purpose approximate inference algorithm such as loopy belief propagation [16], such methods may not converge in densely connected model such as this one, and there are few guarantees on the quality of the results even if they do converge. Fortunately, our model turns out to have additional structure that we can exploit. We now describe an ap- proximate inference algorithm that is tailored to our model, and is guaranteed to converge to a (strong) local maximum. Our first observation is that the only variables that correlate the different protein pairs Tij are the motif variables P.A. Given an assignment to these activity variables, the net- work decomposes into a set of independent subnetworks, one for each protein pair. Based on this observation, we divide our computation of the E-step into two parts. In the first, we find an assignment to the motif variables in each protein, P.A; in the second, we com- pute the posterior probability over the binding motif pair variables T.B, T.S, given the assignment to the motif variables. We begin by describing the second phase. We observe that, as all the motif pair vari- ables, T.A, are fully determined by the motif variables, the only variables left to reason about are the binding variables T.B and T.S. The variables for any pair Tij are inde- pendent of the rest of the model given the instantiation to T.A and the interaction evi- dence. That fact, combined with the noisy-OR form of the interaction, allows us to com- pute the posterior probability required in the E-step exactly and efficiently. Specifically, the computation for the variables associated with a particular protein pair Tij is as fol- lows, where we omit the common prefix Tij to simplify notation. If Aab = false, then

P (Bab = true | Aab = false, O, ) = 0. Otherwise, if Aab = true, then

                                                               P (B           P (B                                                             ab | A, )P (O | Bab = true, A, )                   ab = true | A, O, )                    =                                                                          .                                                                                        P (O | A, )


The first term in the numerator is simply the motif affinity ab; the second term is 1 if O = true and 0 otherwise. The numerator can easily be computed as P (O | A, ) = 1 - (1 - S) (1 - A ab). The computation for P (S) is very similar. a,b =true

We now turn to the first phase, of finding a setting to all of the motif variables. Un- fortunately, as we discussed, the model is highly interconnected, and a finding an optimal joint setting to all of these variables P.A is intractable. We thus approximate finding this joint assignment using a method that exploits our specific structure. Our method iterates over proteins, finding in each iteration the optimal assignment to the motif variables of each protein given the current assignment to the motif activities in the remaining proteins. The process repeats, iterating over proteins, until convergence. As we discussed, the likelihood of each assignment to Pi.A can be easily computed using the method described above. However, the computation for each protein is still ex- ponential in the number of motifs it contains, which can be large (e.g., 15). However, in our specific model, we can apply the following branch-and-bound algorithm (similar to an approach proposed by Henrion [17] for BN2O networks) to find the globally optimal as- signment to the motif variables of each protein. The idea is that we search over the space of possible assignments Pi.A for one that maximizes the objective we wish to maximize. We can show that if making a motif active relative to one assignment does not improve the objective, it will also not improve the objective relative to a large set of other assignments. More precisely, let f (Pi.A) = P (Pi.A, P-i.A|O, ) denote the objective we wish to maximize, where P-i.A is the fixed assignment to motif variables in all proteins except Pi. Let Pi.A-a denote the assignment to all the motif variables in Pi except for Aa. We compute the ratio of f after we switch Pi.Aa from false to true. Let ha(Pj) = (1 - P ab) denote the probability that motif a does not bind with j .Ab =true any active motif in Pj. We can now compute:

                                f (P                                               g                                            i.Aa = true, Pi.A-a)           a(Pi.A-a) =                                                          =                                     f (Pi.Aa = false, Pi.A-a)                        1 - g                                                                    1 - (1 - S)ha(Pj)                                        hb(Pj)                                   h                                                             a=b,Pi.Ab=true                                         a(Pj )                                                                                           (1)                                                                       1 - (1 - S)                                    h                                                                                               a=b,P                        b(Pj )                    1jn                              1jn                                            i .Ab =true                  Tij .O=false                       Tij .O=true


where g is the prior probability for a motif in protein Pi to be active. Now, consider a different point in the search, where our current motif activity assign- ment is Pi.A-a, which has all the active motifs in Pi.A-a and some additional ones. The first two terms in the product of Eq. (1) are the same for a(Pi.A-a) and a(Pi.A-a). For the final term (the large fraction), one can show using some algebraic manipulation that this term in a(Pi.A-a) is lower than that for a(Pi.A-a). We conclude that a(Pi.A-a) a(Pi.A-a), and hence that:

        f (Pi.Aa = true, Pi.A-a)                                         f (P                                                                 1                  i.Aa = true, Pi.A-a)  1.             f (Pi.Aa = false, Pi.A-a)                                        f (Pi.Aa = false, Pi.A- )                                                                                                                 a


It follows that, if switching motif a from inactive to active relative to Pi.A decreases f , it will also decrease f if we have some additional active motifs. We can exploit this property in a branch-and-bound algorithm in order to find the glob- ally optimal assignment Pi.A. Our algorithm keeps a set V of viable candidates for motif assignments. For presentation, we encode assignments via the set of active motifs they contain. Initially, V contains only the empty assignment {}. We start out by considering

motif assignments with a single active motif. We put such an assignment {a} in V if its f -score is higher than f ({}). Now, we consider assignments {a, b} that have two active motifs. We consider {a, b} only if both {a} and {b} are in V . If so, we evaluate its f -score, and add it to V if this score is greater than that of {a} and {b}. Otherwise, we throw it away. We continue this process for all assignments of size k: For each assignment with active motif set S, we test whether S - {a} V for all a S; if we compare f (S) to each f (S - {a}), and add it if it dominates all of them. The algorithm terminates when, from some k, no assignment of size k is saved. To understand the intuition behind this pruning procedure, consider a candidate assign- ment {a, b, c, d}, and assume that {a, b, c} V , but {b, c, d} V . In this case, we must have that {b, c} V , but adding d to that assignment reduces the f -score. In this case, as shown by our analysis, adding d to the superset {a, b, c} would also reduce the f -score. This algorithm is still exponential in worst case. However, in our setting, a protein with many motifs has a low prior probability that each of them is active. Hence, adding new motifs is less likely to increase the f -score, and the algorithm tends to terminate quickly. As we show in Section 4, this algorithm significantly reduces the cost of our procedure. Our E-step finds an assignment to P.A which is a strong local optimum of the ob- jective function max P (P.A | T.O, ): The assignment has higher probability than any assignment that changes any of the motif variables for any single protein. For that assign- ment, our algorithm also computes the distribution over all of the binding variables, as described above. Using this completion, we can now easily compute the (expected) suffi- cient statistics for the different parameters in the model. As each of these parameters is a simple binomial distribution, the maximum likelihood estimation in the M-step is entirely standard; we omit details.