Sequence alignment Sequence alignment Sequence alignment

3/11/2009 Sequence alignment Miryam VenegasVenegas-Anaya 2009 Sequence alignment In bioinformatics the process to compare sequences of DNA, RNA, or ...

Author: Barrie Sutton

3 downloads 0 Views 431KB Size

Report

Download PDF

Recommend Documents

6.2 Sequence alignment algorithms

Pairwise sequence alignment

Pairwise Sequence Alignment

Multiple Sequence Alignment

Definition of sequence alignment

CLUSTALW Multiple Sequence Alignment

Multiple Sequence Alignment Tools

Multiple sequence alignment

Lesson 8 Multiple Sequence Alignment

Bioinformatics, Sequence and Structural Alignment

Lecture 4, Sequence Alignment. Overview

Lecture 15: Multiple Sequence Alignment

MT-ClustalW: Multithreading Multiple Sequence Alignment

SAGA: sequence alignment by genetic algorithm

Multiple sequence alignment. November 8, 2016

An Overview of Multiple Sequence Alignment Systems

Partitioned optimization algorithms for multiple sequence alignment

An Introduction to Bioinformatics Algorithms Sequence Alignment

Comparison of Multiple Sequence Alignment programs

Multiple Sequence Alignment with Evolutionary Computation

A LOCAL ALGORITHM FOR DNA SEQUENCE ALIGNMENT WITH INVERSIONS

A probabilistic measure for alignment-free sequence comparison

Protein Sequence Alignment. Primary Structure Analysis Part 1

Determining Word Sequence Variation Patterns in Clinical Documents using Multiple Sequence Alignment

3/11/2009

Sequence alignment Miryam VenegasVenegas-Anaya 2009

Sequence alignment In bioinformatics the process to compare sequences of DNA, RNA, or protein that have similarities as a consequence of their functional, structural, or evolutionary relationships l i hi is i referred f d to as Sequence S alignment. Graur D and Li WH, 2000; Morrison D, 2007, Murphy R, 2008; Bradley R 2009.

1

3/11/2009

Goals of sequence alignment

Testing the Hypothesis of homology Homogy Homplasy Mutation distance Structure prediction Catalytic sites Structurally significant regions Database searching Design primers David A. Morrison 2007; Bradley R 2009

To infer phylogenies, homologous sequences must be compared in a way we can identify h homologous l and dd derivate i t regions i among sequences. Typically sequence alignments are represented in text format or graphically

2

3/11/2009

The dot matrix technique for sequence alignment (BROWN TA, 2002)

http://www.geneious.com/assets/demonstrations/part4.html

Important issues

How sequences are compared What type of parameters are used for sequence comparison What methods are used for sequences comparison Is it significant, mathematically? Biologically?

3

3/11/2009

Methods for sequence comparison

Pairwise alignment Visual inspection Dot matrix Alignment algorithms (Dynamic programming) Global alignment Local alignment Distance score and Similarity score Word methods Multiple sequence alignment Dynamic programming Progressive methods Interactive methods Structural alignment

Pairwise alignment Visual inspection Matched bases (similarities) = x Mismatched bases (substitutions) = y Gaps = Null bases = z TTAGACGAGTG

TTGGAGCTG

TTAGACGAGTG TTGGA GC TG

(Length=n ) (Length=m)

n+m=2(x+y)+z

4

3/11/2009

Dot Matrix (Gibbs and McIntype, 1970) 1 AT GCGTCGTT AT__.GCGTCGTT ATCCGCGTC____

2 ATGCGTCGTT ATCCG_CGTC

Matched bases Mismatched bases Gaps

1

Distance score and Similarity score “The optimal alignment is the one in which the number of mismatches and gaps are minimized according to certain criteria” (Graur and Li, 2000) Matched bases (Homology) Mismatched bases (substitutions= Transition and transversions) Gaps penalties (indels= insertion and deletions) Distance score (Dissimilarity index)

Similarity score (Similarity index)

D=Σ miyi + Σ wkzk

S=x- Σ wkzk

y=# mismatches; m=mismatches penalty; z= #gaps in of length k w=positive number of penalties for gaps. wk= a + bk where a is the penalty for a gap and b is the penalty for length of the gap.

5

3/11/2009

TCAGACGAGTG

TCAGACGAGTG

TCGGA----GCTG

TCGGA--GC--TG

S=xS x Σ wkzk X = # of matched pairs

wk = penality for a gap of k nucleotides = wk= a + bk zk = number of gaps with length k If a (penalty for a gap) = 1 (penalty y for length g of the g gap)= p) 2 If b (p Alignment I Alignment II

S = 6 - [ [1 + (2 X 2)] X 1] = 1 S = 7 - [ [1 + (2 X 1)] X 2] = 2

This method identifies the best alignment according to a set of criteria

Alignment algorithms Global alignment To find the optimal global solution for a long very dissimilar sequences using Needleman Needleman-Wunsch Wunsch algorithm (1970) LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA

Local alignment IIs a local l l generall alignment li t method th d useful f l tto align li very similar i il sequences. This method use Smith-Waterman algorithm (1981) --------GKG-------||| --------GKG--------

6

3/11/2009

Dynamic programming Needleman-Wunsch algorithm (1970) and Smith Waterman algorithm (1981) • Pairwise comparison • Calculation of Similarity index for each pair of sequences compared. • New table that includes the alignment score ( i t ) and (pointer) d th the vector t connecting ti the th pointers. • Obtaining the optimal alignment by connecting sequences by their score. • Traceback stage or path graph.

Si,ji j = max

Si-1, j-1 + s(ai, bj) max(x ≥ 1) (Si-x, i x j - wx) max(y ≥ 1) (Si, j-y - wy)

i-x

Si-x, j - wx

Si-1, j-1 + s(ai, bj) i j-y

j

Si, j-y - wy

7

3/11/2009

Word methods

K-tuple method Heuristic method Not ensure the optimal alignment More efficient than mathematical methods Comparison of series of short subsequences (Words) in the query sequence with a data base. BLAST, FASTA family

Multiple sequence alignment

8

3/11/2009

Dynamic programming (mathematical optimal alignment)

Needleman-Wunsch algorithm (1970) and NeedlemanSmith--Waterman algorithm (1981) Smith Computationally expensive Need a matrix of n dimensions where n is the number of sequences Gives the optimal global or local alignment The objective function of ‘sum of pares’ reduce computational demands (MSA)

Progressive programming Da--Fei Feng and Russell F. Doolittle, 1987 Da

History Before 1987 most of schemes for History. constructing trees from sequences use Fitch and Margoliash (1967) scheme

Pairwise comparison of all sequences Assembling sequences by their differences A topology was found according to similarities of the sequence

“Finding the correct tree should depend on assembling a matrix that best describe the differences among the sequences” (Feng and Doolittle, 1987)

9

3/11/2009

Da-Fei Feng and Russell F. Doolittle scheme (1987) Progressive mode

Flow chat

DFaling

Binary mode

(generates the multiple alignment and score matrix)

Score (Difference matrix)

SHUFFLE

(final score, Sreal, Sident, and Sran)

BORD (Preliminary order of the sequences)

BORD (Final order of the sequences) BLEN (Branches length) BLEN Distance matrix

Pre-alignment (fill gaps with neutral element)

TREE-plot final dendrogram

Clustal Familyy

Clustal W (Thompson et al., 1994)

Two major problems with progressive approach: Local minimum problem optimal global solution my not be found because the initial alignment or because incorrect branching order.

Choice of alignment parameter one weighting matrix and two gap penalties

10

3/11/2009

Heuristic method (Des Higgins, 1988) Multiple progressive sequence alignment following a branching order in Neirghbour-tree tree The basic multiple alignment algorithm: 1. All pairs of sequences are aligned separately using K- tuple method in order to calculate a distance matrix Gap penalties: • Dependence on the weight matrix. • Dependence on the similarity of the sequences. • Dependence on the lengths of the sequences. • Dependence on the difference in the lengths of the sequences. 2. A guide tree is calculated from the distance matrix 3. The sequences are progressively aligned according to the branching order in the guide tree.

Functional and structural alignment PRANK+FF program Leytynoja and Goldman, 2008

11

3/11/2009

CLUSTAL W alignment of gp120

The PRANK+F alignment

Leytynoja and Goldman, 2008

12