{"title": "Approximate Inference and Protein-Folding", "book": "Advances in Neural Information Processing Systems", "page_first": 1481, "page_last": 1488, "abstract": null, "full_text": "Approximate Inference and \n\nProtein-Folding \n\nChen Yanover and Yair Weiss \n\nSchool of Computer Science and Engineering \n\nThe Hebrew University of J erusalem \n\n91904 Jerusalem, Israel \n\n{cheny,yweiss} @cs.huji.ac.it \n\nAbstract \n\nSide-chain prediction is an important subtask in the protein-folding \nproblem. We show that finding a minimal energy side-chain con(cid:173)\nfiguration is equivalent to performing inference in an undirected \ngraphical model. The graphical model is relatively sparse yet has \nmany cycles. We used this equivalence to assess the performance of \napproximate inference algorithms in a real-world setting. Specifi(cid:173)\ncally we compared belief propagation (BP), generalized BP (GBP) \nand naive mean field (MF). \nIn cases where exact inference was possible, max-product BP al(cid:173)\nways found the global minimum of the energy (except in few cases \nwhere it failed to converge), while other approximation algorithms \nof similar complexity did not. In the full protein data set, max(cid:173)\nproduct BP always found a lower energy configuration than the \nother algorithms, including a widely used protein-folding software \n(SCWRL). \n\n1 \n\nIntroduction \n\nInference in graphical models scales exponentially with the number of variables. \nSince many real-world applications involve hundreds of variables, it has been im(cid:173)\npossible to utilize the powerful mechanism of probabilistic inference in these appli(cid:173)\ncations. Despite the significant progress achieved in approximate inference, some \nit is not yet known which algorithm to use \npractical questions still remain open -\nfor a given problem nor is it understood what are the advantages and disadvan(cid:173)\ntages of each technique. We address these questions in the context of real-world \nprotein-folding application -\n\nthe side-chain prediction problem. \n\nPredicting side-chain conformation given the backbone structure is a central prob(cid:173)\nlem in protein-folding and molecular design. It arises both in ab-initio protein(cid:173)\nfolding (which can be divided into two sequential tasks -\nthe generation of native(cid:173)\nlike backbone folds and the positioning of the side-chains upon these backbones [6]) \nand in homology modeling schemes (where the backbone and some side-chains are \nassumed to be conserved among the homologs but the configuration of the rest of \nthe side-chains needs to be found). \n\n\fFigure 1: Cow actin binding protein (PDB code 1pne, top) and closer view of its 6 \ncarboxyl-terminal residues (bottom-left). Given the protein backbone (black) and \namino acid sequence, native side-chain conformation (gray) is searched for. Problem \nrepresentation as a graphical model for those carboxyl-terminal residues shown in \nthe bottom-right figure (nodes located at COl atom positions, edges drawn in black). \n\nIn this paper, we show the equivalence between side-chain prediction and inference \nin an undirected graphical model. We compare the performance of BP, generalized \nBP and naive mean field on this problem as well as comparing to a widely used \nprotein-folding program called SCWRL. \n\n2 The side-chain prediction problem \n\nProteins are chains of simpler molecules called amino acids. All amino acids have \na central carbon atom (COl) to which a hydrogen atom, \na common structure -\nan amino group (N H 2 ) and a carboxyl group (COOH) are bonded. In addition, \neach amino acid has a chemical group called the side-chain, bound to COl. This \ngroup distinguishes one amino acid from another and gives its distinctive properties. \nAmino acids are joined end to end during protein synthesis by the formation of \npeptide bonds. An amino acid unit in a protein is called a residue. The formation \nof a succession of peptide bonds generate the backbone (consisting of COl and its \nadjacent atoms, N and CO, of each reside), upon which the side-chains are hanged \n(Figure 1). \n\n\fWe seek to predict the configuration of all the side-chains relative to the backbone. \nThe standard approach to this problem is to define an energy function and use the \nconfiguration that achieves the global minimum of the energy as the prediction. \n\n2.1 The energy function \n\nWe adopted the van der Waals energy function, used by SCWRL [3], which approx(cid:173)\nimates the repulsive portion of Lennard-Jones 12-6 potential. For a pair of atoms, \na and b, the energy of interaction is given by: \n\nE(a, b) = { -k2 :'0 + k~ \n\nEmax \n\nd> Ro \nRo ~ d ~ k1Ro \nk1Ro > d \n\n(1) \n\nwhere Emax = 10, kl = 0.8254 and k2 = ~~k;' d denotes the distance between \na and band Ro is the sum of their radii. Constant radii were used for protein's \natoms (Carbon - 1.6A, Nitrogen and Oxygen - 1.3A, Sulfur - 1.7 A). For two sets \nof atoms, the interaction energy is a sum of the pairwise atom interactions. The \nenergy surface of a typical protein in the data set has dozens to thousands local \nminima. \n\n2.2 Rotamers \n\nThe configuration of a single side-chain is represented by at most 4 dihedral angles \n(denoted Xl,X2,X3 and X4)' Any assignment of X angles for all the residues defines \na protein configuration. Thus the energy minimization problem is a highly nonlinear \ncontinuous optimization problem. \n\nIt turns out, however, that side-chains have a small repertoire of energetically pre(cid:173)\nferred conformations, called rotamers. Statistical analysis of those conformations in \nwell-determined protein structures produce a rotamer library. We used a backbone \ndependent rotamer library (by Dunbrack and Kurplus, July 2001 version). Given \nthe coordinates of the backbone atoms, its dihedral angles \u00a2 (defined, for the ith \nresidue, by Ci - 1 - Ni - Ci - Ci ) and 'IjJ (defined by Ni - Ci - Ci - NHd were \ncalculated. The library then gives the typical rotamers for each side-chain and their \nprior probabilities. \n\nBy using the library we convert the continuous optimization problem into a discrete \none. The number of discrete variables is equal to the number of residues and the \npossible values each variable can take lies between 2 and 81. \n\n2.3 Graphical model \n\nSince we have a discrete optimization problem and the energy function is a sum of \npairwise interactions, we can transform the problem into a graphical model with \npairwise potentials. Each node corresponds to a residue, and the state of each node \nrepresents the configuration of the side chain of that residue. Denoting by {rd an \nassignment of rotamers for all the residues then: \n\nP({ri}) = !e- +E({r;}) \n\nZ \n\n!e -+ L;j E(r;)+E(r;,rj) \nZ \n1 Z II 'lti(ri) II 'ltijh,rj) \n\n(2) \n\nwhere Z is an explicit normalization factor and T is the system \"temperature\" \n(used as free parameter). The local potential 'ltih) takes into account the prior \n\ni \n\ni ,j \n\n\fprobability of the rotamer Pi(ri) (taken from the rotamer library) and the energy \nof the interactions between that rotamer and the backbone: \n\n\\(Ii(ri) = Pi (ri)e-,j,E(ri ,backbone) \n\n(3) \nEquation 2 requires multiplying \\(I ij for all pairs of residues i, j but note that equa(cid:173)\ntion 1 gives zero energy for atoms that are sufficiently far away. Thus we only need \nto calculate the pairwise interactions for nearby residues. To define the topology of \nthe undirected graph, we examine all pairs of residues i, j and check whether there \nexists an assignment ri, rj for which the energy is nonzero. If it exists, we connect \nnodes i and j in the graph and set the potential to be: \n\n(4) \n\nFigure 1 shows a subgraph of the undirected graph. The graph is relatively sparse \n(each node is connected to nodes that are close in 3D space) but contains many \nsmall loops. A typical protein in the data set gives rise to a model with hundreds \nof loops of size 3. \n\n3 Experiments \n\nWhen the protein was small enough we used the max-junction tree algorithm [1] to \nfind the most likely configuration of the variables (and hence the global minimum \nof the energy function). Murphy's implementation of the JT algorithm in his BN \ntoolbox for Matlab was used [10]. \n\nThe approximate inference algorithms we tested were loopy belief propagation (BP), \ngeneralized BP (GBP) and naive mean field (MF). \n\nBP is an exact and efficient local message passing algorithm for inference in singly \nconnected graphs [15]. Its essential idea is replacing the exponential enumeration \n(either summation or maximizing) over the unobserved nodes with series of lo(cid:173)\ncal enumerations (a process called \"elimination\" or \"peeling\"). Loopy BP, that is \napplying BP to multiply connected graphical models, may not converge due to cir(cid:173)\nculation of messages through the loops [12]. However, many groups have recently \nreported excellent results using loopy BP as an approximate inference algorithm \n[4, 11, 5]. We used an asynchronous update schedule and ran for 50 iterations or \nuntil numerical convergence. \n\nGBP is a class of approximate inference algorithms that trade complexity for ac(cid:173)\ncuracy [15]. A subset of GBP algorithms is equivalent to forming a graph from \nclusters of nodes and edges in the original graph and then running ordinary BP on \nthe cluster graph. We used two large clusters. Both clusters contained all nodes \nin the graph but each cluster contained only a subset of the edges. The first clus(cid:173)\nter contained all edges resulting from residues, for which the difference between \nits indices is less than a constant k (typically, 6). All other edges were included \nin the second cluster. It can be shown that the cluster graph BP messages can \nbe computed efficiently using the JT algorithm. Thus this approximation tries to \ncapture dependencies between a large number of nodes in the original graph while \nmaintaining computational feasibility. \n\nThe naive MF approximation tries to approximate the joint distribution in equa(cid:173)\ntion 2 as a product of independent marginals qi(ri) . The marginals qi(ri) can be \nfound by iterating: \n\nqi(ri) f- a\\(li(ri) exp (L L qj(rj) log \\(Iij(ri, rj )) \n\nJENi rj \n\n(5) \n\n\fwhere a denotes a normalization constant and Ni means all nodes neighboring i. \nWe initialized qi(ri) to \\[Ii(ri) and chose a random update ordering for the nodes. \nFor each protein we repeated this minimization 10 times (each time with a different \nupdate order) and chose the local minimum that gave the lowest energy. \n\nIn addition to the approximate inference algorithms described above, we also com(cid:173)\npared the results to two approaches in use in side-chain prediction: the SCWRL and \nDEE algorithms. The Side-Chain placement With a Rotamer Library (SCWRL) \nalgorithm is considered one of the leading algorithms for predicting side-chain con(cid:173)\nformations [3]. \nheuristic search strategy to find a minimal energy conformation in a discrete con(cid:173)\nformational space (defined using rotamer library). \n\nIt uses the energy function described above (equation 1) and a \n\nDead end elimination (DEE) is a search algorithm that tries to reduce the search \nspace until it becomes suitable for an exhaustive search. It is based on a simple \ncondition that identifies rotamers that cannot be members of the global minimum \nenergy conformation [2]. If enough rotamers can be eliminated, the global mini(cid:173)\nmum energy conformation can be found by an exhaustive search of the remaining \nrotamers. \n\nThe various inference algorithms were tested on set of 325 X-ray crystal structures \nwith resolution better than or equal to 2A, R factor below 20% and length up to 300 \nresidues. One representative structure was selected from each cluster of homologous \nstructures (50% homology or more) . Protein structures were acquired from Protein \nData Bank site (http://www.rcsb.org/pdb). \n\nMany proteins contain Cysteine residues which tend to form strong disulfide bonds \nwith each other. A standard technique in side-chain prediction (used e.g. \nin \nSCWRL) is to first search for possible disulfide bonds and if they exist to freeze \nthese residues in that configuration. This essentially reduces the search space. We \nrepeated our experiments with and without freezing the Cysteine residues. \n\nSide-chain to backbone interaction seems to be much severe than side-chain to side(cid:173)\nchain interaction -\nthe backbone is more rigid than side-chains and its structure \nassumed to be known. Therefore, the parameter R was introduced into the pairwise \npotential equation, as follows: \n\n\\[Io(ro ro) -\n-\n\n\", J \n\n(6) \nUsing R > 1 assigns an increased weight for side-chain to backbone interactions \nover side-chain to side-chain interactions. We repeated our experiments both with \nR = 1 and R > 1. It worth mentioning that SCWRL implicitly adopts a weighting \nassumption that assigns an increased weight to side-chain to backbone interactions. \n\n(e -,f-E(ri ,r;))* \n\n\"J \n\n4 Results \n\nIn our first set of experiments we wanted to compare approximate inference to \nexact inference. In order to make exact inference possible we restricted the possible \nrotamers of each residue. Out of the 81 possible states we chose a subset whose \nlocal probability accounted for 90% of the local probability. We constrained the size \nof the subset to be at least 2. The resulting graphical model retains only a small \nfraction of the loops occurring in the full graphical model (about 7% of the loops \nof size 3). However, it still contains many small loops, and in particular, dozens of \nloops of size 3. \n\nOn these graphs we found that ordinary max-product BP always found the global \nminimum of the energy function (except in few cases where it failed to converge). \n\n\f80 \n\n70 \n\n80 \nII! \n.!! 50 \na. \n~ <1l \n\"' 30 \n~ \n\n20 \n\n10 \n\n0 \n\n80 \n\n70 \neo \n\n.. .!! 50 \n\na. \n~ <1l \n\"' 30 \n~ \n\n20 \n\n10 \n\n0 \n\nI \n{;> \" \" \n\n\u2022 \n\n,,, 01> ~ {> \n\n.\" .\" <9 4>
\n\n-- . . . -. - -\n\n,\" 01> ~ {> \nE(SCWRL) - E(Max-product BP) \n\n..,\" \n\n.\u00a7> \n\n.\" .\" <9 4>