CONSTRUCTING PAM MATRICES

CONSTRUCTING PAM MATRICES WINFRIED JUST In this note we will use several different matrices to describe the same penomenon. So we must be very carefu...
Author: Melvin Dorsey
3 downloads 0 Views 81KB Size
CONSTRUCTING PAM MATRICES WINFRIED JUST

In this note we will use several different matrices to describe the same penomenon. So we must be very careful about distinguishing them by names. The standing assumption will be that we have a fixed alphabet of characters {c1, c2, . . . , cn}, and that we are looking at a process (evolution in our case) by which some of these characters mutate into each other. We will assume that for a fixed unit of time T the probability that character ci mutates into character cj is a fixed number mij . The matrix M = [mij ]1≤i,j≤n that lists these probabilities will be referred to as the mutation probability matrix. Note that the mutation probability matrix has two important properties: (1) all entries are nonnegative numbers and (2) the sum of numbers in each row is 1. Thus M is a stochastic matrix. There is another underlying assumption here, namely that the probability that character ci mutates into character cj over a time interval T does not depend on the prior history of the process. Thus we model evolution as a Markov process. In general, a (discrete time) Markov process or Markov chain traces a system through a sequence of steps. At each step the system could be in one of a fixed number of states. In our case, the states would be characters c1, . . . , cn. If the system is in state i, then it switches to state j with probability tij , called the transition probability. Note that the matrix of transition probabilities is exactly the same thing that we called a matrix of mutation probabilities. Here is an interesting observation: If M is a stochastic n×n matrix, then there is usually a unique vector [q1, . . . , qn] such that (1)

[q1, . . . , qn] ◦ M = [q1, . . . , qn]

and

(2)

 q1 q1  lim M k =  . k→∞  .. q1

q2 q2 .. .

... ... .. .

q2

...

 qn qn  ..  . qn

The operation ◦ is matrix multiplication; M k is obtained by multiplying M k times with itself. For our purposes it is not necessary to know the details of how matrix multiplication is performed; it suffices to know that it is a fairly standard operation that can be done on any reasonably powerful graphing calculator or computer algebra system, such as MATLAB. The vector [q1, . . . , qn] in the above observation is called the steady state or equilibrium vector of the process. It has a very intuitive interpretation: The matrix Date: 05/03/05 newpam.tex. 1

2

WINFRIED JUST

multiplication equation [p1, . . . , pn] ◦ M = [r1, . . . , rn] says that if we have a long sequence of characters such that c1 will be present in proportion p1, c2 will be present in proportion p2, . . . cn will be present in proportion pn , and we let the sequence evolve for T units of time, then in the resulting sequence, c1 will be present in proportion r1 , c2 will be present in proportion r2, . . . cn will be present in proportion rn. Thus equation (1) tells us that evolution will not change the proportions of characters as long as these proportions are in the steady state. Equation (2) tells us that if we start at any state and let the process run enough steps, it will spend a proportion of about qi of the time in state i. Alternatively, if we start with a long enough sequence of characters and let it evolve long enough, then the proportions of characters in the sequence will get very close to the numbers in the steady state vector. For this reason, we will refer to the numbers qi in the equilibrium vector as the target frequencies of the process. This leads us to an assumption that often underlies the construction of scoring matrices for sequence comparison and alignment and that is seldom clearly spelled out: It is assumed that the observed character frequencies in the data set that is used to construct the scoring matrix are the target frequencies of the mutation probability matrix for the sequences to be scored. A Markov process is reversible if it looks the same when run forwards or backwards. For example, the process described by matrix (3) below is oviously reversible; the process described by matrix (4) is not.

(3)

  .6 .2 .2 .2 .6 .2 .2 .2 .6

(4)

  .5 .5 0  0 .5 .5 .5 0 .5

Theorem 1. Let M be a mutation probability matrix with the following target frequencies: [q1, . . ., qn]. Then M describes a reversible Markov process if and only if for all 1 ≤ i, j ≤ n we have (5)

qimij = qj mji .

A priori there seems to be no biological reason to assume that molecular evolution is a reversible Markov process. However, the assumption of time-reversibility simplifies the study of molecular evolution (see e.g. [5], page 69), and is being made for this very reason. The assumption is almost certainly wrong, however, as long as we always compare two sequences of different extant organisms, we are really looking at evolutionary time having run backwards from one organism to the last common ancestor and then forwards to the other organism (see below), and since our treatment of the two organisms is symmetric, the assumption of time-reversibility should lead to realistic results. As far as I can see, in the original construction of the PAM matrices in [2], timereversibility was not assumed. The assumptions about the evolutionary process that underlie the construction in [2] are not clearly spelled out in the paper, and the implicit assumptions that I see there do not seem more plausible to me than the assumption of time-reversibility. In [3], Dan Gusfield gives a description of the PAM

CONSTRUCTING PAM MATRICES

3

matrices that in his words (page 383) “roughly, but not exactly” reflects Dayhoff’s construction. Gusfield’s exposition does assume time-reversibility (implicitly), and I will base the following exposition on Gusfield’s description. Now let us define what “PAM” means. The acronym stands for percent accepted mutation. Ideally, two sequences s and t are defined as being one PAM unit diverged if a series of accepted point mutations (and no indels) has converted s to t with an average of one accepted point-mutation per one hundred sequence positions. The term “accepted” here means a mutation that was incorporated into the molecular sequence and passed on to its progeny. Note that in very long sequences that are one PAM unit diverged we will see slightly less than 1% character substitutions, since some loci may have undergone multiple mutations. Fortunately, if the distance is only a few PAMs, the discrepancy is very small and can be ignored in the construction of our matrices. In the original construction of PAM matrices, Dayhoff started with pairs of protein sequences that were each at most 15% diverged. Here we will explain the process in a simplified way. While the real families of PAM (and BLOSUM) matrices are used for scoring amino acid sequences, we will construct here a family of “baby-PAM” matrices for nucleotide sequences as a means of illustrating the constrution. Let us assume we have a collection of perfectly aligned sequences of nucleotides such that in each pair we see character substitutions at about 2% of the loci. We may think of each pair of sequences in this family as being 2 PAMs diverged. Now assume that by counting the observed percentages of character substitutions in this family, we get the following character substitution matrix A2 = [aij ]1≤i,j≤4.  (6)

A A .195  C .001  G .003 T .001

C .001 .295 .001 .003

G .003 .001 .295 .001

 T .001  .003  .001 .195

The meaning of the entries in the above sequence is the following: Picking randomly a pair < s, t > of sequence pairs in our collection, and picking randomly a locus k, the probability that s[k] = A = t[k] is 0.195; the probability that s[k] = A and t[k] = C is 0.001; etc. Note that the probabilities for pairs of different characters sum up to 0.02. This is why we say that the sequences are roughly 2 PAM units diverged. Let us also assume that the average C-G content in our collection of sequences is 60%, and that A’s are as frequent as T’s and C’s are as frequent as G’s. The first step in our construction of “baby-PAM” matrices will be to reconstruct the mutation probability matrix M2 that gave rise to the matrix A2 . Note that the letters s[k], t[k] at a given locus in two of our sequences are derived from a letter r[k] of an ancestral sequence r, that will have an evolutionary distance of about 1 PAM from each of the observed sequences s and t. Now here is a beautiful consequence of the assumption that molecular evolution is a time-reversible process: This detail does not matter! We can think of s evolving into t by going back through the ancestral sequence and then moving forward in time to become t, or we can think of t evolving backwards into r and then into s, and the derived mutation probability matrix will always be the same.

4

WINFRIED JUST

Note that by our other assumption, the target frequencies for the matrix M2 will be [q1, q2, q3, q4] = [0.2, 0.3, 0.3, 0.2]. Let us now treat s as the “ancestral” sequence. Then the probability a12 of s[k] = A and t[k] = C is equal to 0.001. This probability must be equal to q1m12, where m12 is the probability that a given A mutates to C. Thus we can calculate m12 = a12/q1 = 0.001/0.2 = 0.005. In general we will have mij = aij /qi. These calculations lead to the following matrix M2 of character mutation probabilities:  (7)

 A C G T A .975 .005 .015 .005    C .00333 .98333 .00333 .01    G .01 .00333 .98333 .00333 T .005 .015 .005 .975

Now here comes the big question: How to derive scoring matrices for distantly related sequences from data about closely related sequences? In the PAM model, this problem is solved as follows: Evolutionary changes over a long period happen one generation at a time. Thus if we know the matrix M2 of character mutation probabilities for sequences that are 2 PAMs apart, then we can construct the matrix M4 of character mutation probabilities for sequences that are 4 PAMs apart by taking the product M2 ◦ M2 . In general, the matrix M2k of character mutation probabilities for sequences that are 2k PAMs apart can be obtained by taking the k-th power of M2 (that is, multiplying M2 k times by itself). Thus M2k = (M2 )k . For example, the character mutation probability matrix for construction our “baby-PAM120” will be the matrix M120 = (M2 )60, which looks as follows:  (8)

A A .3323  C .1264  G .2212 T .1461

C .1896 .4903 .1619 .3319

G .3318 .1619 .4903 .1896

 T .1462  .2214  .1266 .3324

The character mutation probability matrix for constructing our “baby-PAM250” will be the matrix M250 = (M2 )125, which looks as follows:  (9)

A A .2261  C .1743  G .2199 T .1825

C .2615 .3589 .2469 .3298

G .3298 .2468 .3588 .2614

 T .1827  .2201  .1745 .2263

As a next step in the construction of “baby-PAM120” we must reconstruct the character substitution matrix A120 from the character mutation probability matrix M120. Entry aij in A120 will be the probability that in two sequences s and t that have an evolutionary distance of 120 PAM, at a randomly chosen locus k, we will have s[k] = ci and t[k] = cj . If we treat s as the “ancestral” sequence (as the asumption of time-reversibility allows us to do), then we can get aij by multiplying qi (the probability of finding ci in position s[k]) by mij (the probability that ci mutates to cj . In other words, A120 can be obtained by multiplying the ‘C’ and ‘G’

CONSTRUCTING PAM MATRICES

5

rowss of M120 by 0.3 and the ‘A’ and ‘T’ rows of M120 by 0.2. We get the following matrix A120:  (10)

A A .06646  C .03792  G .06636 T .02924

C .03792 .14709 .04857 .06642

G .06636 .04857 .14709 .03798

 T .02924  .06642  .03798 .06646

Note that the above matrix is slightly inaccurate; in particular, the probabilities for A-G pairs and C-T pairs should be exactly the same. The discrepancy is due to rounding errors in the process of matrix multiplication. (Note that the matrices (8) and (9) suffer from the same defect.) Finally, we are ready to construct the “baby-PAM” scoring matrix S120 itself. Recall that an entry sij of the latter matrix should be the log-odds score of comparing the probability of finding (ci , cj ) in a correctly aligned column with the probability of finding (ci , cj ) in randomly aligned sequences. In other words, we a .03793 should have sij = log2 ( piijpj ). For example, in our case, s12 = log2 ( (.2)(.3) ). Thus we get the matrix:  (11)

 A C G T A .73 −.66 .15 −.46   C −.66 .71 −.88 .15    G .15 −.88 .71 −.65 T −.46 .15 −.65 .73

Let us remark that in the real PAM matrices, the scores are multiplied by two and rounded to the nearest integer. Let us look at two very important characteristics of the matrix (11). First let us ask ourselves: If we score an ungapped alignment of two random strings of length m each with matrix (11), what is the expected value of the alignment score? This expected value can be calculated as m times the expected score for a pair of random loci. The formula for the latter is X E= sij qiqj . 1≤i,j≤n

In our case, this means we have to calculate the sum (.73)(.2)2 + (−.66)(.2)(.3) + (.15)(.2)(.3) + · · · + (.73)(.2)2, which is equal to −.1302. It is not hard to see why the average score for a column in a random alignment should be negative. If the average score for a random character pair were positive, then the scores for random extensions of a local alignment by random letters would tend to rise rather than fall, which would result in a lot of long, completely spurious local alignments. Now let us consider the related entity X H= sij aij . 1≤i,j≤n

This is the average score of a correctly aligned character pair. Since our matrix (11) gives the scores in bits, we can think of H as the average amount of information

6

WINFRIED JUST

supplied by a correctly aligned character pair. Accordingly, H is known as the (relative) entropy of the scoring matrix. For scoring matrix (11) the entropy is equal to H = (.73)(.06646) + (−.66)(.03792) + · · · + (.73)(.06646) = .134. The entropy of a scoring matrix allows us to answer the following question: Given a query sequence of length m and a data base of length `, how many letters in a local alignment will on the average be needed to reach a statistically significant level of alignment? One can show that the score of a significant local alignment should be at least log2 (m) + log2 (`) bits. If one correctly aligned character pair contributes on the average H units of information, then a local alignment with log (m)+log (`) statistically significant score should be at least 2 H 2 character pairs long. For example, if we search a nucleotide sequence of length 1, 000 bp against the whole human genome (3×109 bp) using the scoring matrix (11), then a statistically significant local alignment would usually have to be at least (log2(1000) + log2(3 × 109))/.134 = 242 bp long. Homework 1: Find the mutation probability matrix M40 that can be derived from the matrix M2 given above and that corresponds to an evolutionary distance of 40 PAMs. Homework 2: (1) Construct the character substitution matrix A250 that corresponds to the character mutation probability matrix M250 given above. (2) Use A250 to construct the corresponding “baby-PAM250” scoring matrix S250 with scores given in bits. (3) Find the expected score for random character mismatches and the entropy of S250 . References [1] S. Altschul. Amino acid substitution matrices from an information-theoretic perspective. J. of Mol. Biol., 219:555-565, 1991 [2] M. O. Dayhoff, R. M. Schwartz and B. C. Orcutt. A model of evolutionary change in proteins. Atlas of Protein sequence and structure, 5:345-352, 1978. [3] D. Gusfield. Algorithms on strings, trees, and sequences. Cambridge University Press 1997. [4] S. Henikoff and J. G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89:10915-10919, 1992. [5] W.-H. Li. Molecular evolution. Sinauer Associates 1997. Department of Mathematics,, Ohio University,, Athens, Ohio 45701, U.S.A. E-mail address: [email protected]