Hidden Markov Models and an Introduction to their Use in Bioinformatics

Hidden Markov Models and an Introduction to their Use in Bioinformatics L. Carl Leinbach Gettysburg College Gettysburg, PA USA [email protected]...
Author: Jade Hodge
11 downloads 2 Views 1MB Size
Hidden Markov Models and an Introduction to their Use in Bioinformatics L. Carl Leinbach Gettysburg College Gettysburg, PA USA [email protected]

Topics for Discussion I.

Markov Chains

II. Hidden Markov Models III. DNA, Genes, and CpG Islands IV. Relating HMM’s and CpG Islands V. Viterbi’s Algorithm and Modifications V. The Next Step – Finding The Generating Path

MARKOV CHAINS The defining property of a Markov Chain is that it is “Memory-less” The only state of the system that is used to determine the next state is the present state. What happened in previous states is irrelevant for making the determination.

EXAMPLE: Consider the rather naïve model for weather forecasting: States : S (Sunny), R (Rainy), and C (Cloudy) The following matrix is based on long term weather averages the “Great Mid- Appalachian Mountain” Weather Reporting Region 𝑆 𝑅 𝐶 𝑆 0.58 0.23 0.19 W = 𝑅 0.15 0.52 0.33 𝐶 0.25 0.52 0.40 We can use this model to predict the next day’s weather.: If it is raining today what is the chance that tomorrow will be sunny? Based on past averages – there is a 15% chance of tomorrow being sunny.

Making Longer Term Forecasts In the spirit of the “Evening News”, we will attempt to use this model to help those planning a weekend party and give a five day forecast. To do this, we need only take powers of the matrix W. Wn will give the probabilities for the weather n days from today, where n = 5.

The last matrix in the Derive VECTOR array gives the probabilities, based on this model, for the predicted weather states five days from today, given today’s state. Thus we see that if it is Sunny today, there is an approximate 32% chance that in five days it will be Sunny.

An Interesting Characteristic of Markov Chains Looking at the last matrix in the previous display, you may have observed that the probabilities were “settling down” as the powers increased. Let’s look at some higher powers of W:

In fact, the first seven decimal places of each of the entries have not changed at all between the 20-th and 100-th powers and the entries in each column are approaching what appears to be the same value. This is, in fact, the case and the limit of each of the columns is an eigenvector for the matrix, W. Each of these eigenvectors has an associated eigenvalue of 1.

0.58

S 0.18 0.15 0.23

R 0.52

0.25

0.33

C 0.35

0.40

Displaying A Markov Chain As A Graph

From Markov Chains To Hidden Markov Models Prior to discussing the application of this topic to Bioinformatics we present a variation of Durbin’s “Occasionally Dishonest Casino”. In this example instead of “loaded” dice being substituted for the fair dice on occasion, we will consider coin flips where the person doing the flipping is able to discretely substitute a weighted coin for the fair coin. We will call this example, “ The Dishonest Street Gambler.”

THE SCENEARIO “An amateur magician who is very good at ‘slight of hand; decides to put his talent to work and takes two coins to a busy street corner. One is a Fair Coin with a ½ probability of coming up Heads or Tails. The other looks like the Fair Coin,, but is weighted to show Heads 80% of the time. In order to not be detected ,the, magician, turned street gambler, only substitutes the weighted coin on occasion, about ¼ of the time, on average. The probability of switching back is .1. A policeman watching the gambler in action, keeps a record of some of the gambling transactions, secretly recording the results of each flip. How can he detect what is going on?

Constructing The Model Coin Flipping: p

1-p

For the fair coin: p= ½ For the weighted coin: p=¾ We assume P is the probability of an outcome of a Head

Combining The Two Coins:

For this model we have two Markov models contained within the switching model, that is “hidden” from the public view. NOTE: The coin flipping models are contained within the switching model

Next, Run A Simulation The Derive Program was written under the assumption that the Magician/Gambler, always starts with the Fair Coin for the first five flips, after that the switching between coins is done at random according to the parameters of the model.

The Observed Results: (100 coin flips) HHHTTHTTHHHHHHHHTTTHHHHHTHHHHHHHHHHHTHHHTHHTTH HHTHHHHHHHHHHHHHTHTHHHHHHHTHHHHHHHHHTHHHHHHHH HHHHHHHHH The Underlying States (1 = Fair Coin, 2= Weighted Coin) [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2]

What Can Be Concluded? What Questions Are Unanswered? From observation: • Looking at the results of our simulation, it is fairly obvious that with only 17 Tails coming up in a sequence of 100 coin flips ,that it is highly likely that the flips did not come from a sequence generated by a fair coin alone. • Actually, don’t try to take the above argument to court. Observation is not proof. Even though the ratio of Heads to Tails is extremely high, it is still probable. • Long term observations can lower this probability and increase the strength of our case. What we don’t know: • At what points did our culprit start using the weighted coin? Switch back to the fair coin? • How do we know what the long term ratio of Heads to Tails is for the weighted coin? • As mentioned above, long term, repeated observation can shed light on these questions, but not be absolutely conclusive.

A Very Brief Overview Of DNA, Genes, and CpG Islands DNA: Dioxybo Nucleic Acid made up from four nucleic bases, or nucleotides, Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). The DNA molecule has the famous double helix configuration made ‘famous by Watson and Crick’s famous 1953 publication. Although they stood on the shoulders of many giants, Watson and Crick are the two names most often associated with DNA. The double helix has many properties. That discovered by Watson and Crick is that the two strands are complimentary. A always bonds via an ester bond to T, and C always bonds to G. Thus from these four nucleotides the genetic codes of all living things is made up. Genes: Groupings of DNA that are the “instruction manual” (coded instructions for everything the body needs , especially proteins. The human body has over 25,000 genes and some have been decoded relative to their function and others have been associated with diseases, such as cystic fibrosis, ALS, Sickle Cell, and Huntington’s Disease. More than 98% of the human genome sequence does not encode proteins. Those regions that do encode proteins are called “Open Reading Frames.”

The Basic Paradigm of Biology

1. DNA produces RNA 2. RNA Produces Prote4in

Taken from: E. Birney, Hidden Markov Models in biological sequence analysis, IBM J. Res. & Dev., Vol. 45 No. 3 / 4 , May/July 2001

Genes (continued): The story is further complicated by the fact that in eukaryotes (all life except the simplest forms). The gene is broken up by noncoding sequences, called introns, for intergenetic regions. The coding regions are called exons. During the protein building process, the introns are removed and the exons are arranged according to their function.

Chromosomes: Basically they are packages of genes designed for different purposes, say eye color, shape, transporting oxygen in the blood, etc. Humans have 23 pairs of chromosomes (they must come in pairs) , The most famous are the X and Y chromosomes that determine sex and other body characteristics. CpG islands:: The genomic of all vertebrates appear to have f ewer instances of a C followed by a G than would be normally expected. The reason for this is a process called Methylation. Simply explained, in a methylated region, Whenever a CG pair (denoted CpG to signify that they are part of the same strand, not joined by a bond) , the methylation process, converts the C to a T. In the region of an exon, methylation is turned off and upstream from the exon, CpG pairs are more abundant. These areas are called CpG islands and can be used to locate exons.

DNA As A Markov Chain While a strand of DNA represents the results of a deterministic process, there still seem to be similarities to the models that we discussed previously. The generation of the strand appears to follow certain rules similar to those of a random process. The following diagram was taken from Durbin, Eddy, Krogh, and Mitichison (1998), p 49.

In this diagram each of the nucleotides, A, C, G, T. Is treated as a state of the Markov process. There is also a beginning state, B, and an end state, E. There are no return arrows to B or arrows leaving from E. They just signify the start and end of the sequence. Generally the probability of a transition from a state to E is ignored. We will see later how to handle the probabilities from B to the first nucleotide in the sequence.

Determining The Probabilities On page 50 of ‘the book by Durbin, Eddy, Krogh, and Mitchison, the following results of an examination of a 60,000 nucleotides using Maximum Likelihood estimators for the time a particular nucleotide is followed by another nucleotide. The “+” denotes the probabilities within a CpG island and the “-” denotes a nonCpG island.

Clearly these denote different processes. In particular, note the transition probability from C to G in the “+” model and the “-” model.

An HMM For CpG Islands This diagram is also take from Durbin, Eddy, Krogh, and Mitchison. It can be found on p 52. Note that it assumes that the states at the top of the model and those at the bottom are separately connected by a model similar to the one shown previously (without states B and E.

The B and E states will be added to the front end and back end of this model. The probability for B to any one of the states within the model will be 1/8 and the probability of any state to E be assigned a value of 0 for convenience, i.e. we will ignore state E.

Investigating A Chromosome Sequence For a CpG Region In order to find an example sequence , I went to the NCBI website and searched the nucleotide database with the phrase “Human CpG Islands on ‘Chromosome 3” I found a sequence of 18410 bp under reference number NG_023270.1 I then used the CpG island Finder at http://dbcat.cgm.ntu.edu.tw/ Here is part of the information that I received. The sequence you input has 18410 nucleotides in length. There are 9 possible CpG island(s) found. Possible CpG islands shown in graph:

Defining My Task Given the example sequence, where the CpG islands have been pretty well identified, the Hidden Markov Model is given, and the probabilities within each of the two models at the base of the HMM established, is it possible to conduct experiments to find the model switching probabilities that make the most sense? In other words, my task is to find probabilities for the switching states that will yield the result that a sequence chosen from a CpG island came form the “+” states. I decided to center my investigation on a sequence from CpG region 5, the one most dense in CG dinucleotide pairs.

5. start site:7428 end site:9269 length:1842 GC content:63% Since a lecture slide can not hold the entire sequence, here is a shorter sequence to help us decode what the header is saying

. 2. start site:3361 end site:3642 length:282 GC content:59% CTCACTTGAAGAGAATTCAACGTCTGCACCCAGTGTGCTGGTTCCTC+TCCCCAGGCAAGCTGCAGCAGT TCGAGGGCAGCCGTTGGCATACTGTGGCCCCCTCAGAGCAGCAAAAGC+TGAGCAAGTTGGACGGGCA AGTGTGGATCGCCCTGTACAACCTGCTGCTAAGCCCTGAGGCTCAGGCGCG+CTACTGCCTCACAAGTTT TGCCAAGGGACGGCTACTCAAGGTCAGACTCCCTCCGCACCAGCCCCCACAG+CCCCAGTACCGCCCTC CCCAT

Developing a Formal Language For HMM’s Simply looking at the symbols is no longer enough information to tell us what state the system is in. The nucleotides are not labeled with a “+” or a “-”. We had a similar situation with the street gambler. The two coins looked the same so, unless we could feel the coin and examine it closely, we could not tell a fair coin H or T from a weighted coin H or T. In this case we do not know the probabilities as we did in the Street Gambler example. First let’s take a more formal view of the HMM. In order to do this , we need to decouple the state sequence, which we call a path, π, from the sequence of symbols that we observe, A, C, G, and T.. Since the state sequence is, itself, a Markov Chain, it only depends on the previous state so akl = Prob(πi = l | πi-1= k) A new set of symbols is introduced for the symbols, b, With DNA, each state is associated with a particular symbol. ek(b) = Prob(xi = b| πi = k) These are known as emission probabilities.

Finding The Most Probable State Path If we are testing various choices of path probabilities, it makes sense that we would want to find the most probable path for traversing a sequence. We have the emission probabilities for the symbols, once we are in a particular state. After we have developed the algorithm we will compare results for two different set of state change probabilities defined by the following matrices: First we consider one based on the percentage of the sample that is used by CpG islands in the sample. 0.616 0.384 𝐴= 0.341 0.658 The second matrix is one that places a stricter criterion for a path to exit a state: 0.97 0.03 𝐴= 0.01 099 It is obvious which matrix will yield better results, but the comparison may prove interesting.

The Viterbi Algorithm The path through a sequence consists of the repetition of the same two steps, transition and emission. Since the time and space needed to check each and every path grows exponentially, it is natural that we use dynamic programming. Here is an outline of the algorithm as given in Durbin et alI: Initialization (i=0): v0 = 1, vk = 0 for k > 0, Loop (i = 1, …, L): vl = el(xi) maxk(vk(i – 1) akl), ptri(l) = argmaxk(vk(i – 1) akl , Termination:

P(x, π*) = maxk(vk(L)ak0), π*L = argmaxk(vk(L)ak0),

Traceback (I = L, . . . 1) π*i-1 = ptr(π*i) This can easily be implemented in Derive

The Data For The Viterbi Program The Emission Matrix for symbols:

The Trial State Change Matrix: “+”

“-”

Model “+” A small test sequence: Model “-”

Viterbi Implemented in Derive Setup

Initialize move out of B to 1/8

Return Probability & backtrack pointers

Find Vertibi Vector for step i Choose max pointers for column

The Results

Some Notes: • Pointer digits 1 – 4 indicate model “+”, 5 – 8 indicate “-”. • The top result line is the actual result using the “=“ operation. • It took less than 1 second to obtain the answer. (OK it was .999 seconds) • The fractions for the probabilities get very small. That is the reason that I am making this presentation at a CAS conference. Durbin et al. talk about using base two logarithms and floating point round off. On the next pages I will show that for a sequence 200 bp long, the probability is on the order of 10-150 . What would it be if we tried the entire 1,841 bp long gene expression?

Examining A Sequence From The Gene Expression

The time shown at the bottom of the screen is the time for Derive to do the original calculation using rational arithmetic. The fraction extended over 8 lines (1 line equals numerator/denominator) on my Toshiba mini notebook, NB505)

Using a More Realistic State Change Matrix

• Note that after the original (brief) excursion into the “-” state the most likely path staid in the “+” state.

A Sequence In The Intron Region

The sequence R, for this example was for bp’s 1400 – 1600 of our parent sequence. While the sequence does not lie entirely in the intron region it is a rather ‘cpg poor’ region. Note that all pointers are greater than 4.

Where Do We Go From Here? The main problem is the one of determining the state parameter values for the model. Generally, the community uses training sequences where the actual outcomes can be observed. But how is this information used to find the optimal path through a related, unknown sequence? There are a variety of strategies, but they go beyond the scope of our investigation and the audience that we will be addressing. For those interested in pursuing this problem, the text by Durbin, Eddy, Krogh, and Mitchison is a good starting point. There, in addition to the Viterbi algorithm, you can learn of the Forward, Backward, and Baum-Welch Training. A lot of work is being done by David Haussler at the University of California at Sana Cruz. He is looking at Evolutionary Conserved Exons combining HMMs with phylogenetics to yield better gene prediction. The annoying fact is that the problem, as it is currently being approached is an NPHard problem and will probably not be defeated. Does Computer Algebra have a role to play, other than using rational arithmetic, to play in finding a new approach? That is our challenge.

References 1.

R. Durbin, S. Eddy, A. Krogh, G. Mitchison, Biological Sequence Analysis, Cambridge University Press, Cambridge, UK, 1998

2.

K. Hyman, “Gene finding With Hidden Markov Models”, The Scientist – Magazine of the Life Sciences, http://www.thescientist.com/article/display’1534/#ixzz1O212EJ7C

3.

A. Isaev, Introduction to Mathematical Methods In Bioinformatics, Springer, Berlin, 2004

4.

A. Seipel, D. Haussler, “Computational Identification of Evolutionarily Conserved Exons”, http://www.cse.ucsc.edu/~acs/recomb2004.pdf Proceedings of the 8th Annual International Conference on Research in Computational Biology (RECOMB) 2004

5.

Web Reference; http://CodeCereal.blogspot.com/2011/06/cpg-islands2.html ,CpG Islands (2): - Building a Hidden Markov Model

6.

Web Reference; http://CodeCereal.blogspot.com/2011/06/cpg-islands3.html ,CpG Islands (3): Model Evaluation