From sequence to structure: Protein Structure Prediction. Outline. Protein and protein structure. Jun-tao Guo. Introduction to protein structure

From sequence to structure: Protein Structure Prediction Jun-tao Guo Computational Systems Biology Laboratory Department of Biochemistry and Molecular...
Author: Edgar Shields
3 downloads 0 Views 1MB Size
From sequence to structure: Protein Structure Prediction Jun-tao Guo Computational Systems Biology Laboratory Department of Biochemistry and Molecular Biology University of Georgia August 16, 2004

Outline • Introduction to protein structure • Protein structure prediction --Prediction methods (Homology Modeling, Threading, ab initio) --Protein structure prediction sub-problems --Energy functions --Algorithms --Assessment --CASP/CAFASP

Protein and protein structure >102M:_ MYOGLOBIN (154 AA) MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL GYQG

1

Amino acids Alanine Arginine Asparagine Aspartic acid Cysteine Glutamine Glutamic acid Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine

Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val

A R N D C Q E G H I L K M F P S T W Y V

Levels of protein structure ---primary structure (amino acid sequence) ---secondary structure (helices, strands, coils/loops) ---tertiary structure (3D packing of secondary structures)

---quaternary structure (spatial arrangements of multiple chains)

Secondary structures • It is a local description of structure • Two common secondary structures anti-parallel

parallel

α-helix •

β-strand

coil or loop

2

Protein structure prediction Problem: Given the amino acid sequence of a protein, what’s its 3-dimensional shape (geometrical conformation)? MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHL KTEAEMKASEDLKKAGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKI PIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKEL GYQG

?

Sub-problems --Disordered region prediction --Secondary structure prediction --Transmembrane helices prediction --Solvent accessibility prediction --Residue-residue contact prediction

Why protein structure prediction 1. Importance of protein structure --Knowledge of the structure of a protein enable us to understand its function better --Design better mutagenesis experiments --Structure-based rational drug design 2. Current methods for protein structure determination --x-ray crystallography --nuclear magnetic resonance (NMR)

time-consuming and very expensive !!

3

Why protein structure prediction 3. Big gap between the number of protein sequences and the number of protein structures. (As of June 15, 2004) --Swiss-prot, 153,320 protein sequences --PDB (Protein Data Bank), 25,960 protein structures

4. Protein 3D structure is more highly conserved than the primary sequence. --Physical forces favor certain structures --The number of fold is limited (TIM barrel has 26 superfamilies)

5. Fundamental unsolved problem.

PDB growth in new folds

Old folds are shown in orange, new folds are in light blue. Data from http://www.rcsb.org

Protein structure classification (SCOP) Family: Clear evolutionary relationship, protein in the same family are homologous, sequence identity >=30%. Superfamily: Low sequence identity, probable common evolutionary origin. Fold: May not have a common evolutionary origin. Major structural similarity. http://scop.mrc-lmb.cam.ac.uk/scop/ SCOP 1.65 release: 2327 families, 1294 superfamilies, 800 folds

Other protein classification databases: CATH, FSSP

4

Protein structure classification (CATH) Class

Architecture

Topology

http://www.biochem.ucl.ac.uk/ bsm/cath/cath_info.html

Computational Methods for Protein Structure Prediction •

homology/comparative modeling --similar sequenceÆ similar structures --practically very useful, need homologues

• fold recognition/threading --many proteins share the same structural fold --a folding problem becomes a fold recognition problem

• ab initio --use first principles to fold proteins --does not require templates --high computational complexity

Homology modeling Observation: proteins with similar sequences tend to fold into similar structures. --Target sequence is aligned with the sequence of a known structure, they usually share sequence identity of 30% or more. --Superimpose target sequence onto the template, replacing equivalent sidechain atoms where necessary. --Refine the model by minimizing an energy function.

Programs: Modeller, SwissModel

5

Homology modeling • Homology modelling is more reliable than other methods. • But, you can’t always find similar sequences of known structure.

From Baker, D, Sali, A. (2001). Science 294, 93-6

ab initio structure prediction An energy function to describe the protein o o o o o

bond energy bond angle energy dihedral angel energy van der Waals energy electrostatic energy

Efficient and reliable algorithms to search the conformational space to minimize the function and obtain the structure. Not practical in general o o

Computationally too expensive Accuracy is poor

ab initio structure prediction Three main methods: --simulation: start from an extended form and fold into native/native-like structure.

--screening: generate many structures, then find the one that is best. --fragment assembly: combine local structure prediction, simulation and screening. ROSETTA(David Baker’s group)

6

ROSETTA • Use reduced protein representation. • Construct a library of small structure fragments, eg. 6, 9 AA • Cut a target sequence to sequence fragments. For each sequence fragment, choose some candidate fragments from the fragment library. • Assemble the fragments by Monte Carlo simulation. • The generated structures are grouped into some clusters. • Clusters are ranked by their energy.

Protein threading „

Make a (backbone) structure prediction through finding an optimal placement (threading) of a protein sequence onto each known structure (structural template) „

„

“placement” quality is measured by some statistics-based energy functions best overall “placement” among all templates may give a structure prediction -- also depending on additional criteria

Query sequence MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE

templates

……

Protein threading Four key components of threading: --template library --energy functions --threading algorithms --confidence assessment

7

Template library • Non-redundant representatives through structure-structure and/or sequence-sequence comparison FSSP (http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html) SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) PDB-Select (http://www.sander.embl-heidelberg.de/pdbsel/) Pisces (http://www.fccc.edu/research/labs/dunbrack/pisces/)

Threading templates --Residue type --Structural core information --Secondary structure type --Solvent accessibility --Coordinates for Cα / Cβ / centroid --Sequence profile RES RES RES RES RES RES

1 5 5 5 5 1

G P G Y C G

156 157 158 159 160 161

S 23 H 110 H 61 H 91 H 8 S 14

10.528 12.622 17.186 16.174 12.670 15.263

-13.223 -17.353 -15.086 -10.939 -12.752 -17.741

9.932 10.577 9.205 12.208 15.349 14.529

11.977 12.981 16.601 16.612 14.163 15.022

-12.741 -16.146 -15.457 -12.343 -13.137 -16.815

10.115 11.485 10.578 12.727 15.545 15.733

Energy functions MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE how preferable to put

pairwise two particular

residues nearby: E_p

Gap

alignment gap penalty: E_g

how well a residue fits a structural environment: E_s

singleton

sequence similarity mutation between query and template proteins: E_c

total score: w1E_p + w2E_s + w3E_c + w4E_g + w5E_m

Find a sequence-structure alignment to optimize this function

8

Mutation energy A measurement of similarity between amino acids insertion

FDSK-THRGHR :.: :: ::: FESYWTH-GHR deletion

Match (‘:’) Similar (‘.’) Mismatch (‘ ‘) Insertion/deletion

Substitution Matrices -- amino acid substitution matrices account for the probability of one amino acid being substituted for another: frequency of substitution - genetic code tolerance for changes - natural selection

--empirically derived from observed amino acid substitutions that occur between aligned residues in homologous sequences --use a matrix to penalize residues pairs that have a low probability of mutation in evolution and rewards pairs with a high probability

Substitution Matrices •

Two popular sets of matrices for protein sequences 1. PAM (Percent Accepted Mutations) The first substitution matrix introduced by Dayhoff et al., 1978.

2. BLOSUM (BLOcks SUbstitution Matrix) Henikoff & Henikoff, 1992

9

PAM • It measures how often different amino acids replace other amino acids in evolution. • It is based on a data base of 1,572 changes in 71 groups of closely related proteins. • There is a family of matrices of PAM: PAM-10, ….., PAM250,…., and these matrices are extrapolated from PAM-1 matrix by matrix multiplication. • PAM distance: Two sequences are defined to have diverged by one PAM unit if they show in average one accepted point mutation (i.e. one amino acid change) per hundred amino acids PAM250: 250 accepted mutations per 100 amino acids

PAM250 Matrix

BLOSUM Matrices • Similar idea to PAM matrices • Blocks: highly conserved regions in a set of aligned protein sequences (local multiple alignment) • Number of BLOSUM matrix (e.g. BLOSUM 62) indicates the cutoff of percent identity that defines the clusters. Lower cutoffs allow more diverse sequences. -BLOSUM 62: 62% identical -BLOSUM 50: 50% identical

10

BLOSUM 62

Which matrix to use? • Close homolog: high cutoffs for BLOSUM (up to BLOSUM 90) or lower PAM values. BLAST default: BLOSUM 62

• Remote homolog: lower cutoffs for BLOSUM (down to BLOSUM 10) or high PAM values (PAM 200 or PAM 250) . A threading best performer: PAM 250

Position-dependent sequence Profiles Using evolutionary history of a protein family SAANLEYLKNVLLQFIFLKPG--SERERLLPVINTMLQLSPEEKGKLAAV NEKNMEYLKNVFVQFLKPESVP-AERDQLVIVLQRVLHLSPKEVEILKAA KNEKIAYIKNVLLGFLEHKE----QRNQLLPVISMLLQLDSTDEKRLVMS REINFEYLKHVVLKFMSCRES---EAFHLIKAVSVLLNFSQEEENMLKET MLIDKEYTRNILFQFLEQRD----RRPEIVNLLSILLDLSEEQKQKLLSV EPTEFEYLRKVMFEYMMGR-----ETKTMAKVITTVLKFPDDQAQKILER DPAEAEYLRNVLYRYMTNRESLGKESVTLARVIGTVARFDESQMKNVISS STSEIDYLRNIFTQFLHSMGSPNAASKAILKAMGSVLKVPMAEMKIIDKK

O15045 P34562 Q06704 Q92805 O42657 O70365 Q21071 Q18013

11

Gap penalty function insertion

FDSK---THRGHR :.: :: ::: FESYWSCTH-GHR deletion

Gaps are depicted with ‘–’ Corresponding to insertion/deletion in evolution Gap penalty function, w(k): cost of a gap of length k Linear gap penalty function vs affine gap penalty function

Linear gap penalty function w(k) = gk, Where g is a constant. --Easy to implement in algorithms --Satisfactory performance in alignment

But…………. • a gap of length k is more probable than k gaps of length 1 -a big gap may be due to a single mutational event that inserted/deleted a stretch of characters -separated gaps are probably due to distinct mutational events • a linear gap penalty function treats these cases the same.

Affine gap penalty function w(k) = h + gk, k ≥ 1, w(0) = 0; Where h and g are constants. h: opening gap penalty g: extension gap penalty

FDSK---THRGHR :.: :: ::: FESYWTCTH-GHR

FDSK-T--HRGHR :.: : ::: FESYWTCTH-GHR

12

singleton A singleton energy measures each residue’s preference in a specific structural environments. --secondary structure --solvent accessibility

Compare actual occurrence against its “expected value” by chance

Where Kim D. et al., 2003

Singleton score matrix Helix

ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

Buried Inter -0.578 -0.119 0.997 -0.507 0.819 0.090 1.050 0.172 -0.360 0.333 1.047 -0.294 0.670 -0.313 0.414 0.932 0.479 -0.223 -0.551 0.087 -0.744 -0.218 1.863 -0.045 -0.641 -0.183 -0.491 0.057 1.090 0.705 0.350 0.260 0.291 0.215 -0.379 -0.363 -0.111 -0.292 -0.374 0.236

Exposed -0.160 -0.488 -0.007 -0.426 1.831 -0.939 -0.721 0.969 0.136 1.248 0.940 -0.865 0.779 1.364 0.236 -0.020 0.304 1.178 0.942 1.144

Sheet

Buried Inter 0.010 0.583 1.267 -0.345 0.844 0.221 1.145 0.322 -0.671 0.003 1.452 0.139 0.999 0.031 0.177 0.565 0.306 -0.343 -0.875 -0.182 -0.411 0.179 2.109 -0.017 -0.269 0.197 -0.649 -0.200 1.249 0.695 0.303 0.058 0.156 -0.382 -0.270 -0.477 -0.267 -0.691 -0.912 -0.334

Exposed 0.921 -0.580 0.046 0.061 1.216 -0.555 -0.494 0.989 -0.014 0.500 0.900 -0.901 0.658 0.776 0.145 -0.075 -0.584 0.682 0.292 0.089

Loop

Buried Inter 0.023 0.218 0.930 -0.005 0.030 -0.322 0.308 -0.224 -0.690 -0.225 1.326 0.486 0.845 0.248 -0.562 -0.299 0.019 -0.285 -0.166 0.384 -0.205 0.169 1.925 0.474 -0.228 0.113 -0.375 -0.001 -0.412 -0.491 -0.173 -0.210 -0.012 -0.103 -0.220 -0.099 -0.015 -0.176 -0.030 0.309

Exposed 0.368 -0.032 -0.487 -0.541 1.216 -0.244 -0.144 -0.601 0.051 1.336 1.217 -0.498 0.714 1.251 -0.641 -0.228 -0.125 1.267 0.946 0.998

Secondary structure match score •





Secondary structure prediction is mature and can achieve ~80% accuracy The performance of using probabilities of the predicted three secondary structure states (α-helices, β-strand, and loop) is better May have a risk of over-dependence on secondary structure prediction

13

Local backbone potential

From Chandonia JM, Cohen FE J Mol Biol. 2003 Sep 26;332(4):835-50

Pair-wise energy „

„

It measures the preference of a pair of amino acids to be close in 3D space. How close is close? „ „ „

„

Distance dependent Single cutoff Cα, Cβ, or centroid of the sidechain

Observed occurrence of a pair compared with its “expected” occurrence

Parameters for pairwise term ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

-140 268 105 217 330 27 122 11 58 -114 -182 123 -74 -65 174 169 58 51 53 -105 ALA

-18 -85 -616 67 -60 -564 -80 -263 110 263 310 304 62 -33 -80 60 -150 -132 171 ARG

-435 -417 106 -200 -136 -103 61 351 358 -201 314 201 -212 -223 -231 -18 53 298 ASN

17 278 -1923 67 191 -115 140 122 10 68 -267 88 -72 -31 -288 -454 190 272 -368 74 -448 318 154 243 294 179 294 -326 370 238 25 255 237 200 -160 -278 -564 246 -184 -667 95 54 194 178 122 211 50 32 141 13 -7 -12 -106 301 -494 284 34 72 235 114 158 -96 -195 -17 -272 -206 -28 105 -81 -102 -73 -65 369 218 -46 35 -21 -210 -299 7 -163 -212 -186 -133 206 272 -58 193 114 -162 -177 -203 372 -151 -211 -73 -239 109 225 -16 158 283 -98 -215 -210 104 52 -12 157 -69 -212 -18 81 29 -5 31 -432 129 95 268 62 -90 269 58 34 -163 -93 -312 -173 -5 -81 104 163 431 196 180 235 202 204 -232 -218 269 -50 -42 46 267 73 ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR

-20 -95 101 TRP

-6 107 -324 TYR VAL

14

Multi-body energy scores --Three-body interaction --Four-body interaction „

Delaunay tessellation „ „ „

Divide the model into a network of 4 residues 4 residues represents the vertices of a tetrahedron The circumspheres of the tetrahedron are empty (no residue within the circumsphere)

Delaunay Triangle (tetrahedron in 3 dimensions)

Parameter optimization „

„

„

The contribution of each term (weight). Based on threading performance on a training set (fold recognition and alignment accuracy). Different weight for different classes? (superfamily, fold) pair-wise may contribute more for fold level threading mutation/profile terms dominate in superfamily level threading

Etotal = ωmEmutation + ωsEsingleton + ωpEpairwise + ωgEgap + ωssEss

Mathematical formulation of threading

Etotal( Tj

Tj+2

Tj+7 Tj+4

S T K Y Q C D D A Si…………………………Si+8

template structure (T)

Target sequence (S)

15

Protein threading NP-hard: Lathrop, 1994 --with gaps --long range residue interactions

Types of alignments „

Global alignment: the alignment of complete sequences „

„

Local alignment: find best subsequence match „ „ „

„

Needleman & Wunsch 1970 J Mol Biol 48:443

Smith & Waterman 1981 J Mol Biol 147:195 modified from Needelman-Wunsh algorithm can be done with heuristics (FASTA and BLAST)

Semi-global: find best match without penalizing gaps on the ends of the alignment

Algorithms-dynamic programming Foundation: any partial subpath ending at a point along the true optimal path must itself be an optimal path leading up to that point. So the optimal path can be found by incremental extensions of optimal subpaths.

Steps: 1. Initialization: construct an (n+1) x (m+1) matrix F for two sequences of lengths n and m. 2. Matrix fill: for each cell in the matrix F, check all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring. 3. Traceback: construct an alignment back from the last cell in the matrix (or the highest scoring) cell to give the highest scoring alignment.

16

Dynamic programming To fill cell F(i,j) in the matrix:

Complexity of dynamic programming „

Computing time: O(nm), where n and m are sequence lengths).

„

Retrieval time: O(Max (n,m))

„

Required memory: O(nm).

„

Keeping in mind the computational complexity while programming

Algorithms considering long range interactions „

Approximation Algorithm „ „ „

„

Frozen approximation algorithm (A. Godzik et al.) Double dynamic programming (D. Jones et al.) Monte carlo sampling (S.H. Bryant et al.)

Exact Algorithm „ „ „

Branch-and-bound (R.H. Lathrop and T.F. Smith) Divide-and-conquer (Y. Xu et al.) --PROSPECT Linear programming (J. Xu et al.) --RAPTOR

17

Approximation Algorithms • The standard dynamic programming can not be used when

considering pairwise interactions. The score for placing one residue at a given position of the structure depends on the position of other residue(s). • Frozen approximation algorithm (FAA) and double dynamic programming (DDP) are heuristic algorithms, which use dynamic programming, but make an approximation to deal with pairwise interactions.

Frozen approximation

From “Protein Bioinformatics” Eidhammer I. Jonassen I. And Taylor WR

• assuming the neighbouring fold positions are occupied by the residue

• an be iterated.

Double Dynamic Programming • It was originally developed for structural alignments. (Taylor and Orengo, 1989; Orengo and Taylor, 1990)

• Use two levels of dynamic programming, a high level scoring matrix and a low level matrix for each high level matrix element. • For each Fij in the high level scoring matrix, it shows how likely it is that the pair is on an optimal alignment. • For each Fij , the likelihood is found by a (low level) optimal alignment with the constraint that Fij is part of the alignment • The scores along the low level alignments are accumulated in the high level scoring matrix

18

Double Dynamic Programming

From “Protein Bioinformatics” Eidhammer I. Jonassen I. And Taylor WR

Formulation with pairwise contact --no gap for core alignment --pairwise interactions only between cores detailed

core i

core i+1

core i+2

core i+3

pair contacts simplified

core i

core i+1

core i+2

core i+3

Formulation with pairwise contacts --no gap for core alignment --pairwise interactions only between cores detailed

core i

core i+1

core i+2

core i+3

pair contacts simplified

core i

core i+1

core i+2

core i+3

19

Divide and Conquer (1) Divide-and-conquer algorithm: • •

repeatedly bi-partition a template into sub-structures till cores merge partial alignments into longer alignments optimally Bi-partition template Pair contacts

Template structure

Query sequence

Divide and Conquer (2) Template partition Xu et al., 1998, 2000

Divide and Conquer (3) Aligned cores preserve their order Aligned cores do not overlap

sequence-template alignment

20

Divide and Conquer (4) Computational complexity: mn + MnCNC m: length of template n: length of sequence M: number of cores in template N: maximum allowed gap for loop alignment C: topological complexity (