Predicting Protein Folding Pathways

Predicting Protein Folding Pathways     Deb Bardhan , Jingjing Hu , Vinay Nadimpally , Ganesh Ramesh ,   Chris Bystroff , Mohammed J. Zaki ...
Author: Alfred Black
1 downloads 0 Views 422KB Size
Predicting Protein Folding Pathways 







Deb Bardhan , Jingjing Hu , Vinay Nadimpally , Ganesh Ramesh ,   Chris Bystroff , Mohammed J. Zaki





Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY  Department of Biology, Rensselaer Polytechnic Institute, Troy, NY  Department of Computer Science, University at Albany-SUNY, Albany, NY zaki, deb, huj5, nadimv  @cs.rpi.edu, [email protected], [email protected]

Abstract

Keywords PRIMARY: protein structure comparison SECONDARY: protein structure prediction OTHER: protein folding pathway

 This work was supported in part by NSF Career Award IIS-0092978, DOE Career Award DE-FG02-02ER25538, and NSF grant EIA-0103708.

1

1 Introduction Proteins fold spontaneously and reproducibly (on a time scale of milliseconds) into complex three-dimensional (3D) globules when placed in an aqueous solution, and, the sequence of amino acids making up a protein appears to completely determine its three dimensional structure [3, 13]. At least two distinct though interrelated tasks can be stated. 1. Structure Prediction Problem: Given a protein amino acid sequence (i.e., linear structure), determine its three dimensional folded shape (i.e., tertiary structure). 2. Pathway Prediction Problem: Given a protein amino acid sequence and its three dimensional structure, determine the time ordered sequence of folding events, called the folding pathway, that leads from the linear structure to the tertiary structure. The structure prediction problem is widely acknowledged as an open problem, and a lot of research in the past has focussed on it. The pathway prediction problem, on the other hand, has received almost no attention. It is clear that the ability to predict folding pathways can greatly enhance structure prediction methods. In this paper we focus on the pathway prediction problem. Note that while there has been considerable work to understand folding intermediates via molecluar dynamics and experimental techniques, to the best of our knowledge ours is one of the first works to predicts folding pathways. Traditional approaches to protein structure prediction have focused on detection of evolutionary homology [2], fold recognition [4, 18], and where those fail, ab initio simulations [19] that generally perFigure 1: Folding Pathway form a conformational search for the lowest energy state [17]. However, the conformational search space is huge, and, if nature approached the problem using a complete search, a protein would take millions of years to fold, whereas proteins are observed to fold in milliseconds. Thus, a structured folding pathway, i.e., a time ordered sequence of folding events, must play an important role in this conformational search [3]. The nature of these events, whether they are restricted to “native contacts,” i.e., contacts that are retained in the final structure, or whether they might include nonspecific interactions, such as a general collapse in size at the very beginning, were left unanswered. Over time, the two main theories for how proteins fold became known as the “molten globule/hydrophobic collapse” (invoking non-specific interactions) and the “framework/nucleation-condensation” model (restricting pathways to native contacts only). Strong experimental evidence for pathway-based models of protein folding has emerged over the years, for example, experiments revealing the structure of the “unfolded” state in water [14], burst-phase folding intermediates [7], and the kinetic effects of point mutations (“phi-values” [16]). These pathway models indicate that certain events always occur early in the folding process and certain others always occur later (see Figure 1). Currently, there is no strong evidence that specific non-native contacts are required for the folding of any protein [5]. Many simplified models for folding, such as lattice simulations, tacitly assume that non-native contacts are “off pathway” and are not essential to the folding process [11]. Therefore, we choose to encode the assumption of a “native pathway” into our algorithmic approaches. This simplifying assumption allows us to define potential folding pathways based on a known three-dimensional structure. We may further assume that native contacts are formed only once in any given pathway. 2

1.1 “Unfolding” Protein Pathways

Figure 2: Folding Pathway for protein Dihydrofolate Reductase (PDB:4dfr), 159 residues) Knowledge of pathways for proteins can give important insight into the structure of proteins. To make pathway based approaches to structure prediction a reality, plausible protein folding pathways need to be predicted. One approach to enumerate folding pathways is to start with an unfolded protein and consider the various possibilities for the protein to fold. This approach is infeasible due to the explosively large number of possibilites to consider for the pathways. Our novel approach is to start with a folded protein in its final state and learn how to “unfold” the protein in a time-ordered sequence of steps, to its unfolded state. The reversal of such a sequence could be a plausible protein folding pathway. Our contributions stem from this basic approach. In this paper, we explore the role of minimum cuts on weighted graphs in determining a plausible sequence of unfolding steps. As an example, Figure 2 shows the unfolding steps for the protein Dihydrofolate Reductase (PDB code 4dfr). In the top left corner is the full protein. The various groups of SSEs are labeled according to the unfolding timeline. That is, SSEs labeled 5 are the first to unfold, and last to fold, while SSEs labeled 0 are last to unfold and first to fold. Let’s consider how would one begin to untangle the different secondary structural elements? The basic intuition is to break as few contacts as possible, and to avoid splitting a SSE held at both ends. Among several choices, the best option is to pick to one that has the least impact on the remaining part of the protein. Considering various choices, it is clear that all groups of SSEs except 3 and 5 are held at both ends by other SSEs. Group 3 is at the start of the sequence (N-terminus), while group 5 is at the end (C-terminus). It is clear that group 3 is held close to group 4, thus the best group for pulling away is group 5. Note also that 3

group 5 must unfold as a group, since the left strand is sandwiched between strand 4 and the right strand 5. Group 5 is held only at one end to the rest of the protein and can thus be stripped away easily. The remaining protein is shown in the middle of top row in Figure 2. At the next step we have two choices. We can remove either group 3 or group 4. The loop of group 3 is held by group 4 and its other end is held by a strand from group 2. Thus group 4 is in fact the best SSE to pull out (Figure 2, top right). After that group 3 can easily be pulled away (Figure 2, bottom left). Finally we can pull away group 1 ( single strand) and finally group 0 (Figure 2, bottom middle). Note that figures have been reoriented to show the best places to make a cut. From the above unfolding steps, we can conclude that one highly plausible pathway is the one that proceeds with SSE group 0 (    ) initiation site, followed by the other folding events for SSE groups 1 (  ), 2, 3 and 4 (   ), and 5 (anti-parallel   ). Although no one has determined the precise order of appearance of secondary structures for any protein, there is fragmentary evidence that supports this type of pathway for several well-studied proteins, including specifically for protein 4dfr [6, 8, 9].

2 Preliminaries The main goal of this paper is to integrate the protein topology information and contact map information and study the relevance of graph models in predicting plausible protein pathway folding. Specifically, we conjecture that an augmented graph constructed based on protein topology and contact maps can effectively utilize minimum capacity edge cuts to determine the probable sequence of folding events that occur in the protein folding process, by determining the most likely sequences of unfolding events. Protein Contact Maps The 3D conformation of a protein may be compactly represented in a symmetrical, square, boolean matrix of pairwise, inter-residue contacts called the contact map. The contact map of a protein is a particularly useful representation of protein structure. Two amino acids in a protein that come into contact with each other form a non-covalent interaction (hydrogen-bonds, hydrophobic effect, etc.). More formally, we say that two amino acids (or residues)  and  in a protein are in contact if the 3D  distance    is at most some threshold value  (a common value is  ˚ ), where    "! #%$'&(#*)+! , and #$ and #*) are the coordinates of the  -Carbon atoms of amino acids  and  (an alternative convention uses  -carbons for all but the glycines). We define sequence separation as the distance between two amino acids  and  in the amino acid sequence, given as ! , &.-/! . A contact map for a protein with 0 residues is an 02130 binary matrix 4 whose element 456,  -  87 if residues , and - are in contact, and 496,  -  ;: otherwise. Figure 3 shows the contact map for Protein 2igd from the Protein Data Bank (PDB). A contact map provides useful information about the protein’s secondary structure elements (SSEs; namely,  -helices and  -strands), and it also captures nonlocal interactions giving clues to its tertiary structure. For example, clusters Figure 3: 3D structure for protein G (PDB file 2igd, Sequence Length 61), and its Contact Map. Clusters of4contacts indicate secondary structure elements (SSE); the

cluster along the main diagonal is an < -helix, and the clusters parallel and anti-parallel to the diagonal are parallel and anti-parallel = -sheets, respectively.

of contacts represent certain secondary structures:  -Helices appear as bands along the main diagonal since they involve contacts between one amino acid and its four successors;  -Sheets are thick bands parallel or anti-parallel to the main diagonal. Moreover, a contact map is rotation and translation invariant, an important property for data mining. It is also possible to recover the 3D structure from contact maps [21]. Graphs and Minimum Cuts An undirected graph >?A@ CBD is a structure that consists of a set of vertices @EGFIH   H  KJKJKJ HLM and a set of edges B GFN+ O6P  H  ! P  HRQ.@9M , i.e., each edge N+ is an unordered pair of vertices. A weighted graph is a graph where in addition to the vertex set and edge set, there is a weight function S T BVUXW associated with the edge set. For each edge N Q B , SYZNI  is called the weight of the edge NI . A path between two vertices P  H[Q\@ is an ordered set of vertices FIH   H  K]^]^]^ H`_aM such that H  bP , H`_9VH and for every 7dce-gfGh , 6H+  HIji   Q B . Two vertices P  H(Qk@ are said to be connected in > if there exists a path between P and H . A connected component l is a maximal set of vertices lnmG@ , such o H ), P and H are connected in > . A graph is said to be a connected graph that for every P  H.Qgl with (Ppq if rsP  HtQu@ , P and H are connected. Let >YvA@ CBw be a simple undirected weighted graph with a weight function S T BxUyW . Further, let us assume that > is a connected graph. An edge cut 4 , is a set of edges 48m B , which when removed from the graph, partitions the graph into two connected components @  and @  (hence, @ /z @  |{ and @ s} @  ~@ , @  " o { , @  ; o { ). The capacity of the edge cut 4 is defined as €C‚`ƒwSYZN  . The minimum capacity edge cut is the edge cut, which out of all the possible edge cuts in the graph, has the least capacity. The minimum capacity edge cut need not be unique, i.e, for a graph there can be more than one set of edges that form an edge cut of the smallest capacity. The definition of edge cuts can be extended suitably to graphs that are not connected. For the rest of this paper, minimum cuts for a graph will refer to the minimum capacity edge cuts for the graph. Weighted SSE Graph A protein can be represented as a weighted secondary structure element graph (WSG), where the vertices are the SSEs comprising the protein and the edges denote proximity relationshipo between the secondary structures. Furthermore, the edges are weighted by the strength of the interaction between two SSEs. Following the convention used in protein topology or TOPS diagrams [20, 22], we use triangles to represent  -strands, and circles to represent  -helices. To correctly model the secondary structure elements and their interaction, Figure 4: TOPS Diagram the edge construction and their weights are determined from the protein’s con- for Protein 2IGD tact map. The edge weights are determined as follows: we determine the list of SSEs and their sequence positions from the known 3D structure taken from the Protein Data Bank (PDB) 1 . Every SSE is a vertex in the WSG. Let @„bFIH   H  KJKJKJ+ H L M denote a protein with … SSEs. Each SSE H has starting (H ]‡† ) and ending (H` ] N ) sequence positions, where 7ˆc‰Ha ]‡† fŠH ] N‹c[0 , and 0 is the length of the protein. For every pair of SSEs HŒ and HI we first compute the number of contacts between them in the contact C−terminus

1

http://www.rcsb.org/pdb/

5

N−terminus

map, given as

 6H  HI  

Ž AŽ •I € ‘ € ‘ 496,  - 

“’ ŽZ ” j’ AŽ •– ”

An edge exists between two SSEs if there are a positive number of contacts between them, i.e., 3— : , and the weight assigned to the edge is the number of contacts  between the SSEs. This way structures with greater bonding between them get higher weights in the WSG representation. An example WSG for protein 2igd is shown in Figure 4

3 Predicting Folding Pathways The basic intuition behind the unfolding process stems from the belief that unfolding occurs by breaking as few contacts as possible. Given an weighted SSE graph for a protein, a minimum capacity edge cut represents the set of edges that partition the WSG into two components which have the property that the number of contacts between them (and hence the number of bonds) are the smallest. Hence, minimum capacity edge cuts on WSGs can help us determine the points in the protein where unfolding is likely to occur. The problem of determining the minimum capacity edge cuts on weighted graphs is a well studied problem in graph theory. For a comprehensive introduction to the various approaches that exist for solving minimum capacity cuts, the reader is referred to [1]. Several polynomial time algorithms exist for computing the minimum capacity edge cuts and for our purposes, an algorithm based on edge contraction [15], whose source code was freely available, sufficed. An unfolding event according to our model, is a set of edges that form a minimum capacity edge cut in a WSG for a protein. Our method works as follows. First, the minimum capacity edge cut for the initial WSG is determined. Ties are broken arbitrarily. This is taken as the first event in the unfolding process. The edges that form this cut are deleted from the WSG and the process is repeated. Thus, the method produces as output, a sequence of unfolding events, i.e., a sequence of minimum capacity edge cuts for a given WSG. This sequence when reversed produces our prediction for the folding pathway for the given protein.

4 Experimental Results The present work is motivated by two guiding factors in the study of protein folding pathways. 1) The protein folding mechanism is dependant on the topology of the secondary structure of the protein. 2) The second well accepted idea is of the hierarchical nature of protein folding. Our model captures both of these well established facts about protein folding. To establish the utility of our methodology we predict a hierarchical folding pathway for a couple of proteins. For these proteins, the secondary structures of the intermediate stages in the folding pathway are experimentally determined. A very commonly used graph technique, min-cut approach was taken to determine the folding pathway from the graph constructed from the Tops diagram. At each step of application of the min cut algorithm we expect those secondary structure elements who have weak contacts with the rest of the SSEs to separate out , or in other words these would be the ones which would be involved in the initial stages of protein unfolding scenario. The last few steps involve the SSEs which we believe would be involved in the initial stages of protein folding. This is expected because of the high dense contacts present amongst them. BPTI(Bovine pancreatic trypsin inhibitor) a small protein containing 2 alpha helices and two beta strands. The topology cartoon of this protein is shown in the below figure. Also shown is the graphical 6

representation of the protein with the weights representing the no:of contacts between the respective SSEs. [10] It is known that the unfolding pathway of this protein involves the loss of the helix structure followed by the beta structure. Applying the min cut principle to the corresponding graph representation of this structure we observe exactly the same trend. The first cut seperates the first helix structure and the second cut the remaining helix structure. This is analogous to loss of the helix structure during the initial stages of unfolding of the protein. The Beta structures with highest number of contacts unfold only in the later stages. 1DV9 (Beta-lactoglobulin structure) is a fairly large protein containing 10 beta strands and 3 helixes as shown in the figure 5. Beta strands F, G and H are formed immediately once the refolding starts [12], which was thus identified as the folding core of beta lactoglobulin. The sequence of min cuts is enumerated in Table 1 cutno 1 2 3 4 5 6 7 8 9 10 11 12 13 14

edges along which cut is made F 1,4 M F 2,4 M , F 4,3 M F 2,3 M F 1,2 M F 12,13 M F 2,4 M F 11,10 M , F 11,3 M F 7,8 M F 6,7 M F 3,12 M F 9,10 M F 4,5 M F 3,10 M F 8,9 M , F 5,6 M

Table 1: The enumeration of the unfolding stages of the protein 1DV9

11

11

2

1 7

3

8

19

5

17

13 10

9

6 3

0.5

6 19

15 5

3

4

2

C

8

7

12

19

17

13 10

9

12

3

6 19

13

15 5

4

2

C

13

0.5 1

N

1

N

Figure 5: Protein 1DV9: a) Weighted SSE Graph b) WSG after 7th stage of min cut Using our min cut approach, the vertices (8,9,10) correspond to the F,G and H beta strands that remain together till the last stages of the min cut operation as shown in the figure 5, indicating their late involvement in unfolding. Thus the results confirm to the documented literature that these beta strands form the core 7

domain and therefore are involved in the early stages of folding. Its only in the eight stage of our min cut application that this core consisting of the three beta strands is broken and infact forms a separate component at the end of 7th stage as shown in figure 5. The sequence of unfolding events for the 4dfr protein can be expressed as shown in table 2 showing the sequence of minimum capacity cuts on the weighted TOPS graph for this protein. cutno 1 2 3 4 5 6 7 8 9 10 11 12

edges along which cut is made F 14,2 M F 11,12 M , F 13,12 M F 7,4 M , F 7,5 M , F 7,6 M F 1,3 M , F 14,3 M F 11,1 M , F 11,10 M , F 14,1 M , F 13,1 M F 5,4 M , F 6,4 M , F 6,9 M , F 6,8 M F 8,4 M , F 8,10 M F 4,1 M , F 4,9 M , F 4,10 M F 9,1 M , F 9,10 M F 1,10 M F 11,14 M , F 11,13 M F 14,13 M

Table 2: The enumeration of the unfolding stages of the protein 4dfr

5 Conclusions In this paper we developed automated techniques to predict protein folding pathways. We construct a weighted SSE graph for a protein, where each vertex is a SSE, and each edge represents the strength of contacts between two SSEs. We use a repeated mincut approach in the WSG graph to discover strongly inter-related groups of SSEs and we then predict a time order of appearance of SSEs along the folding pathway.

8

References [1] Ravindra K. Ahuja, Thomas L. Magnanti, and James B. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice Hall, Englewood Cliffs, NJ, 1993. [2] S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17), 3389-402, 1997. [3] C. Anfinsen and H. Scheraga. Experimental and theoretical aspects of protein folding. Adv. Protein Chemistry, 29, 205-300, 1975. [4] S. Bryant. Evaluation of threading specificity and accuracy. Proteins, 26(2), 172-85, 1996. [5] G. Chikenji and M. Kikuchi. What is the role of non-native intermediates of beta-lactoglobulin in protein folding? Proc, Nat’l Acad. Sci. USA, 97:14273–7, 2000. [6] C. Clementi, P. A. Jennings, and et al. How native-state topology affects the folding of dihydrofolate reductase and interleukin-1beta. Proc. Natl. Acad. Sci. U S A, 97(11):5871–6, 2000. [7] W. Colon and H. Roder. Kinetic intermediates in the formation of the cytochrome c molten globule. Nature Structural Biology, 3(12), 1019-25, 1996. [8] D.K. Heidary, Jr. J. C. O’Neill, and et al. An essential intermediate in the folding of dihydrofolate reductase. Proc. Natl. Acad. Sci. U S A, 97(11):5866–70, 2000. [9] P.A. Jennings, B. E. Finn, and et al. A reexamination of the folding mechanism of dihydrofolate reductase from escherichia coli: verification and refinement of a four-channel model. Biochemistry, 32(14):3783–9, 1993. [10] Steven L. Kazmirski and Valerie Daggett. Simulations of the structural and dynamical properties of denatured proteins: The molten coil state of bovine pancreatic trypsin inhibitor. J. Mol. Biol., 277:487– 506, 1998. [11] D.K. Klimov and D. Thirumalai. Multiple protein folding nuclei and the transition state ensemble in two-state proteins. Proteins, 43:465–75, 2001. [12] Kazuo kuwata, Ramachandra Shastry, Hong Cheng, Masaru Hoshino, Carl A. Bhatt, Yuji Goto, and Heinrich Roder. Structural and kinetic characterization of early folding events of lactoglobnulin. Nature, 8(2):151–155, 2001. [13] C. Levinthal. Are there pathways for protein folding? J. Chem. Phys., 65, 44-45, 1968. [14] Y. Mok, C. Kay, L. Kay, and J. Forman-Kay. Noe data demonstrating a compact unfolded state for an sh3 domain under non-denaturing conditions. J. Mol. Biology, 289(3), 619-38, 1999. [15] H. Nagamochi, T. Ono, and T. Ibaraki. Implementing an efficient minimum capacity cut algorithm, 1994. [16] B. Nolting, R. Golbik, J. Neira, A. Soler-Gonzalez, G. Schreiber, and A. Fersht. The folding pathway of a protein at high resolution from microseconds to seconds. Proc. Natl. Acad. Sci. USA, 94(3), 826-30, 1997. 9

[17] K.T. Simons, C. Kooperberg, E. Huang, and D. Baker. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. Journal of Molecular Biology, 268(1), 209-25, 1997. [18] M. Sippl. Helmholtz free energy of peptide hydrogen bonds in proteins. J. Mol. Biology, 260(5), 644-8, 1996. [19] J. Skolnick, A. Kolinski, and A. Ortiz. Derivation of protein-specific pair potentials based on weak sequence fragment similarity. Proteins, 38(1), 3-16, 2000. [20] M.J.E. Sternberg and J.M. Thornton. On the conformation of proteins: The handedness of the connection between parallel beta strands. Journal of Molecular Biology, 110:269–283, 1977. [21] M. Vendruscolo, E. Kussell, and E. Domany. Recovery of protein structure from contact maps. Folding & Design, 2(5), 295-306, September 1997. [22] D. R. Westhead, T. W. F. Slidel, T. P. J. Flores, and J. M. Thornton. Protein structural topology: automated analysis, diagrammatic representation and database searching. Protein Science, 8:897–904, 1999.

10

Suggest Documents