Sequence alignment Sequence alignment Sequence alignment

3/11/2009 Sequence alignment Miryam VenegasVenegas-Anaya 2009 Sequence alignment In bioinformatics the process to compare sequences of DNA, RNA, or ...
Author: Barrie Sutton
3 downloads 0 Views 431KB Size
3/11/2009

Sequence alignment Miryam VenegasVenegas-Anaya 2009

Sequence alignment In bioinformatics the process to compare sequences of DNA, RNA, or protein that have similarities as a consequence of their functional, structural, or evolutionary relationships l i hi is i referred f d to as Sequence S alignment. Graur D and Li WH, 2000; Morrison D, 2007, Murphy R, 2008; Bradley R 2009.

1

3/11/2009

Goals of sequence alignment „

„ „

„ „

Testing the Hypothesis of homology „ Homogy „ Homplasy Mutation distance Structure prediction „ Catalytic sites „ Structurally significant regions Database searching Design primers David A. Morrison 2007; Bradley R 2009

To infer phylogenies, homologous sequences must be compared in a way we can identify h homologous l and dd derivate i t regions i among sequences. Typically sequence alignments are represented in text format or graphically

2

3/11/2009

The dot matrix technique for sequence alignment (BROWN TA, 2002)

http://www.geneious.com/assets/demonstrations/part4.html

Important issues „ „

„ „

How sequences are compared What type of parameters are used for sequence comparison What methods are used for sequences comparison Is it significant, mathematically? Biologically?

3

3/11/2009

Methods for sequence comparison „

„

„

Pairwise alignment „ Visual inspection „ Dot matrix „ Alignment algorithms (Dynamic programming) „ Global alignment „ Local alignment „ Distance score and Similarity score „ Word methods Multiple sequence alignment „ Dynamic programming „ Progressive methods „ Interactive methods Structural alignment

Pairwise alignment Visual inspection Matched bases (similarities) = x Mismatched bases (substitutions) = y Gaps = Null bases = z TTAGACGAGTG

TTGGAGCTG

TTAGACGAGTG TTGGA GC TG

(Length=n ) (Length=m)

n+m=2(x+y)+z

4

3/11/2009

Dot Matrix (Gibbs and McIntype, 1970) 1 AT GCGTCGTT AT__.GCGTCGTT ATCCGCGTC____

2 ATGCGTCGTT ATCCG_CGTC

Matched bases Mismatched bases Gaps

1

Distance score and Similarity score “The optimal alignment is the one in which the number of mismatches and gaps are minimized according to certain criteria” (Graur and Li, 2000) Matched bases (Homology) Mismatched bases (substitutions= Transition and transversions) Gaps penalties (indels= insertion and deletions) Distance score (Dissimilarity index)

Similarity score (Similarity index)

D=Σ miyi + Σ wkzk

S=x- Σ wkzk

y=# mismatches; m=mismatches penalty; z= #gaps in of length k w=positive number of penalties for gaps. wk= a + bk where a is the penalty for a gap and b is the penalty for length of the gap.

5

3/11/2009

TCAGACGAGTG

TCAGACGAGTG

TCGGA----GCTG

TCGGA--GC--TG

S=xS x Σ wkzk X = # of matched pairs

wk = penality for a gap of k nucleotides = wk= a + bk zk = number of gaps with length k If a (penalty for a gap) = 1 (penalty y for length g of the g gap)= p) 2 If b (p Alignment I Alignment II

S = 6 - [ [1 + (2 X 2)] X 1] = 1 S = 7 - [ [1 + (2 X 1)] X 2] = 2

This method identifies the best alignment according to a set of criteria

Alignment algorithms Global alignment To find the optimal global solution for a long very dissimilar sequences using Needleman Needleman-Wunsch Wunsch algorithm (1970) LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA

Local alignment IIs a local l l generall alignment li t method th d useful f l tto align li very similar i il sequences. This method use Smith-Waterman algorithm (1981) --------GKG-------||| --------GKG--------

6

3/11/2009

Dynamic programming Needleman-Wunsch algorithm (1970) and Smith Waterman algorithm (1981) • Pairwise comparison • Calculation of Similarity index for each pair of sequences compared. • New table that includes the alignment score ( i t ) and (pointer) d th the vector t connecting ti the th pointers. • Obtaining the optimal alignment by connecting sequences by their score. • Traceback stage or path graph.

Si,ji j = max

Si-1, j-1 + s(ai, bj) max(x ≥ 1) (Si-x, i x j - wx) max(y ≥ 1) (Si, j-y - wy)

i-x

Si-x, j - wx

Si-1, j-1 + s(ai, bj) i j-y

j

Si, j-y - wy

7

3/11/2009

Word methods „ „ „ „ „

„

K-tuple method Heuristic method Not ensure the optimal alignment More efficient than mathematical methods Comparison of series of short subsequences (Words) in the query sequence with a data base. BLAST, FASTA family

Multiple sequence alignment

8

3/11/2009

Dynamic programming (mathematical optimal alignment) „

„ „

„ „

Needleman-Wunsch algorithm (1970) and NeedlemanSmith--Waterman algorithm (1981) Smith Computationally expensive Need a matrix of n dimensions where n is the number of sequences Gives the optimal global or local alignment The objective function of ‘sum of pares’ reduce computational demands (MSA)

Progressive programming Da--Fei Feng and Russell F. Doolittle, 1987 Da „

History Before 1987 most of schemes for History. constructing trees from sequences use Fitch and Margoliash (1967) scheme „ „ „

Pairwise comparison of all sequences Assembling sequences by their differences A topology was found according to similarities of the sequence

“Finding the correct tree should depend on assembling a matrix that best describe the differences among the sequences” (Feng and Doolittle, 1987)

9

3/11/2009

Da-Fei Feng and Russell F. Doolittle scheme (1987) Progressive mode

Flow chat

DFaling

Binary mode

(generates the multiple alignment and score matrix)

Score (Difference matrix)

SHUFFLE

(final score, Sreal, Sident, and Sran)

BORD (Preliminary order of the sequences)

BORD (Final order of the sequences) BLEN (Branches length) BLEN Distance matrix

Pre-alignment (fill gaps with neutral element)

TREE-plot final dendrogram

Clustal Familyy „

Clustal W (Thompson et al., 1994) „

Two major problems with progressive approach: „ Local minimum problem optimal global solution my not be found because the initial alignment or because incorrect branching order. „

Choice of alignment parameter one weighting matrix and two gap penalties

10

3/11/2009

Heuristic method (Des Higgins, 1988) Multiple progressive sequence alignment following a branching order in Neirghbour-tree tree The basic multiple alignment algorithm: 1. All pairs of sequences are aligned separately using K- tuple method in order to calculate a distance matrix Gap penalties: • Dependence on the weight matrix. • Dependence on the similarity of the sequences. • Dependence on the lengths of the sequences. • Dependence on the difference in the lengths of the sequences. 2. A guide tree is calculated from the distance matrix 3. The sequences are progressively aligned according to the branching order in the guide tree.

Functional and structural alignment PRANK+FF program Leytynoja and Goldman, 2008

11

3/11/2009

CLUSTAL W alignment of gp120

The PRANK+F alignment

Leytynoja and Goldman, 2008

12