Computational Problems in Evolution

Computational Problems in Evolution Multiple Alignment, Genome Rearrangements, and Tree Reconstruction ISAAC ELIAS Doctoral Thesis Stockholm, Swede...

Author: Rhoda Pamela Copeland

2 downloads 0 Views 691KB Size

Report

Download PDF

Recommend Documents

Computational Problems in Modeling Evolution and Inferring Gene Families MEHMOOD ALAM KHAN

The invention ofschoolprimary problems. An evolution study

Tasks, Evolution and Problems of Extension Development in Poland

Computation in Computational Linguistics

Computational Methods in Physics

Problems, Unsolved Problems and Problems in Concurrency

Computational Linguistics in Sweden

Computational Intelligence in Games

Current Problems and Future Directions in Computational Science for Natural Resource Management

NUMERICAL METHODS IN COMPUTATIONAL ENGINEERING

COMPUTATIONAL THINKING IN DER LEHRERBILDUNG

Computational Logic in Structural Design

Transitions in quantum computational power

Computational Challenges in QCD Thermodynamics

Computational Challenges in E-Commerce

evolution nacher Evolution nacher Evolution

THE PROBLEMS OF THE FINANCIAL LAW EVOLUTION IN CENTRAL AND EASTERN EUROPE WITHIN THE INTEGRATION PROCESSES

Networks in Evolution

PARTNERSHIP EVOLUTION IN AVIATION

in permanent evolution

EVOLUTION IN PRACTICE

Computational biology

Computational Astrophysics

Computational Physics

Computational Problems in Evolution

Multiple Alignment, Genome Rearrangements, and Tree Reconstruction

ISAAC ELIAS

Doctoral Thesis Stockholm, Sweden 2006

TRITA CSC-A 2006-22 ISSN 1653-5723 ISRN KTH/CSC/A--06/22--SE ISBN 91-7178-511-6 ISBN 978-91-7178-511-4

KTH School of Computer Science and Communication SE-100 44 Stockholm SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till oﬀentlig granskning för avläggande av teknologie doktorsexamen i datalogi Måndagen den 15 January 2007 klockan 10.00 i FA32, Albanova, Roslagstullsbacken 21, Stockholm. © Isaac Elias, Nov 2006 Tryck: Universitetsservice US AB

iii Abstract Reconstructing the evolutionary history of a set of species is a fundamental problem in biology. This thesis concerns computational problems that arise in different settings and stages of phylogenetic tree reconstruction, but also in other contexts. The contributions include: • A new distance-based tree reconstruction method with optimal reconstruction radius and optimal runtime complexity. Included in the result is a greatly simplified proof that the NJ algorithm also has optimal reconstruction radius. (co-author Jens Lagergren) • NP-hardness results for the most common variations of Multiple Alignment. In particular, it is shown that SP-score, Star Alignment, and Tree Alignment, are NPhard for all metric symbol distances over all binary or larger alphabets. • A 1.375-approximation algorithm for Sorting By Transpositions (SBT). SBT is the problem of sorting a permutation using as few block-transpositions as possible. The complexity of this problem is still open and it was a ten-year-old open problem to improve the best known 1.5-approximation ratio. The 1.375-approximation algorithm is based on a new upper bound on the diameter of 3-permutations. Moreover, a new lower bound on the transposition diameter of the symmetric group is presented and the exact transposition diameter of simple permutations is determined. (co-author Tzvika Hartman) • Approximation, fixed-parameter tractable, and fast heuristic algorithms for two variants of the Ancestral Maximum Likelihood (AML) problem: when the phylogenetic tree is known and when it is unknown. AML is the problem of reconstructing the most likely genetic sequences of extinct ancestors along with the most likely mutation probabilities on the edges, given the phylogenetic tree and sequences at the leafs. (co-author Tamir Tuller) • An algorithm for computing the number of mutational events between aligned DNA sequences which is several hundred times faster than the famous Phylip packages. Since pairwise distance estimation is a bottleneck in distance-based phylogeny reconstruction, the new algorithm improves the overall running time of many distancebased methods by a factor of several hundred. (co-author Jens Lagergren)

Contents Contents

iv

1 Introduction 1.1 Basic Introduction to Computation . . . . . . . . . . . . . . . . . . . 1.2 Basic Biological Background . . . . . . . . . . . . . . . . . . . . . . .

3 3 5

2 Introduction to Computational Problems 11 2.1 Phylogenetic Tree Reconstruction using Distance Methods . . . . . . 14 2.2 Tree Reconstruction based on Sequences that have Evolved through Substitutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Multiple Sequence Alignment - Local Point Mutations . . . . . . . . 24 2.4 Genome Rearrangements - Global Mutations . . . . . . . . . . . . . 27 3 Present Investigation 3.1 Multiple Sequence Alignment is NP-hard . . . . . . 3.2 Approximating Sorting By Transpositions . . . . . . 3.3 Fast Neighbor Joining . . . . . . . . . . . . . . . . . 3.4 Fast Distance Estimation from Aligned Sequences . . 3.5 Ancestral Sequence Reconstruction using Likelihood 3.6 Conclusions and Open Problems . . . . . . . . . . . Bibliography

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

35 35 38 40 41 42 43 45

iv

1

Acknowledgments After four and a half years of studying and research I have ﬁnally come to the point where I can write down my thanks to all the people that have helped me. I have been part of the theoretical computer science group at KTH and the Stockholm Bioinformatics Center. Further, I have spent more than a year and a half in Israel, visiting Tel Aviv University and also the Technion. All over I have met people who have shown great kindness and who have helped me with my research, studies, and life. The person I owe the greatest gratitude to is my advisor Jens Lagergren. Jens, you have given me freedom and always been there to give advice and to guide. You have understood my needs, both professional and personal, summarized them for me, and helped me get to were I am today. Without you this would never have happened. Tack så jättemycket! To the people that I have been in contact with in the theory group. Johan Håstad, your classes have been a great inspiration, it has been a joy to have heard such complex subjects. Mikael Goldmann, your positive attitude in the class room and to problem solving has been a shining example. To the other people, attending seminars and classes with smart and enthusiastic people like you have been a great experience. Thanks! To the people at the Stockholm Bioinformatics Center. Bengt Sennblad, you always made yourself available to helped me understand issues in models of evolution. Without you I would have been lost more than once. Erik Lindahl, your rapid and detailed explanations directly contributed to one of the papers. Ali Toﬁgh, you have on numerous occasions helped me with C++ and provided me with precis explanations of its workings. Thanks! To the people in Israel. Benny Chor, without your kindness I would never have come to Israel, where I have found so much happiness. I am very grateful for all your help and shared interest in coﬀee and sailing. Tzvika Hartman, I remember the ﬁrst time we sat down doing research together. It was a great experience to make progress with our illusive problem. Tamir Tuller, your exceptional ability to always be ”squeezed” while still being happy to do even more research is both impressive and inspiring. Also special thanks to Ron Pinter for the help during my last visit to Israel. There are many people who have not been part of my academic life while still being of utmost importance to me. My parents, I love you! You have been there and given me a home away from the front. To the rest of my family and friends, I love you! Finally, wonderful, beautiful, loving Tali, you have given me more happiness than I have ever felt before.

2

CONTENTS

Publications and Organization of the Thesis This thesis a summary of the papers listed below, the papers appear after this summary. The results are presented in three chapters. The ﬁrst chapter contains a basic introduction to computation and biology. This is followed by an introduction to the problems that this thesis is concerned with. The last chapter contains a, to a large extent, self-contained presentation of most of the results included in the papers below. • Settling the Intractability of Multiple Alignment [46, 45] I. Elias Journal of Computational Biology 2006 Conference version in Int. Symp. on Algorithms and Computation 2003 • A 1.375-Approximation Algorithm for Sorting by Transpositions [48, 47] I. Elias and T. Hartman To appear in IEEE/ACM Trans. on Comp. Biology and Bioinformatics Conference version in Workshop on Algorithms in Bioinformatics 2005 • Fast Neighbor Joining [49] I. Elias and J. Lagergren Int. Coll. on Automata, Languages and Programming 2005 • Fast Computation of Distance Estimators [50] I. Elias and J. Lagergren Submitted • Reconstruction of Ancestral Genomic Sequences Using Likelihood [51] I. Elias and T. Tuller Submitted Isaac has contributed with at least half of the work in each of the papers.

Chapter 1

Introduction This is a thesis in computer science and it concerns the interdisciplinary ﬁeld of computational biology. In this ﬁeld, techniques from computer science are applied to solve problems inspired by biology. A central notion in biology is models of evolution which describe the origin and descent of species. The major concern of this thesis is computational problems that biologists are faced with when tracing evolution under diﬀerent models. This chapter contains basic background to relevant issues of computation and biology.

1.1

Basic Introduction to Computation

A common problem in evolutionary biology is the reconstruction of the evolutionary history of a set of species, usually represented by a phylogenetic tree. One such example is the reconstruction of the phylogenetic tree of the great apes, see Figure 1.1. There is frequently little ancient information and instead the evolutionary biologist has to rely on the biological features of the extant species, i.e., living species. Classic evolutionary biology dealt mainly with physical and morphologic traits, such as the shape and size of the scull, while modern evolutionary biology mainly uses information extracted from genetic material, such as DNA sequences. The underlying assumption when reconstructing phylogenetic trees is that two species with a close common ancestor are more similar than two species with a more remote common ancestor. With regard to DNA sequences, similarity is measured with respect to a model of evolution which describes the type of mutations that causes the sequences to change over time. A simple similarity measurement, between two sequences, is using the number of mutations needed to transform one sequence into the other. In this variant, two species are considered similar if a small number of mutations can be used to explain the diﬀerences in their DNA sequences. Once a model of evolution has been chosen the problem of reconstructing the correct tree becomes purely computational. With respect to the simple model above, the evolutionary biologist has to ﬁnd the tree that explains the extant species 3

4

CHAPTER 1. INTRODUCTION

s SS s Ss S SSs Orangutan s S Gorilla s SSs Human

Chimpanzee

Figure 1.1: The phylogeny of the great apes.

using the least number of mutations. Such computational problems are called optimization problems. Further, the objective, for this particular problem, is to ﬁnd the tree that minimizes the number of mutations. With regard to genetic sequences, the problem of computing the optimal tree is easier said than done. An algorithm has to be designed which takes genetic sequences as input and gives the optimal phylogenetic tree as output. Ideally, the algorithm should compute the optimal tree quickly. However, the amount of time it takes depends on the length of the input, i.e., the length of the genetic sequences times the number of species. The speed of an algorithm is, therefore, deﬁned in relation to the length of the input, a.k.a. runtime complexity. To a computer scientist, an algorithm is efficient if there is a polynomial f (x) which provides an upper bound on the time it takes to execute the algorithm on input of length x. Unfortunately, under most models of evolution there are no (and probably never will be) eﬃcient algorithms for computing the optimal tree. To a computer scientist this means that the problem is inherently intractable and no algorithm could possible solve it quickly. Since a phylogenetic tree still needs to be reconstructed, the computer scientist does his best at designing an algorithm which does the job as well as possible. Sometimes there are eﬃcient algorithms which return approximate solutions; solutions that have a guaranteed bound on how far from optimal they are. Many other times, especially for problems in evolutionary biology, heuristic algorithms are used that often work well, but in some cases could be bad. As Professor Richard Lipton stated it [86]: ”In biology approximate results may be okay at times, algorithms that work well only on average may be okay, and even algorithms that do not work every time may be okay.” In general, while some problems are intractable many other problems do have eﬃcient algorithms that compute optimal solutions. However, if the input is large it is not always enough with an eﬃcient algorithm. In particular, many computational problems in evolutionary biology have such large inputs that the algorithms need to be as eﬃcient as possible to be useful. That is, the algorithms should have a runtime complexity which is upper bounded by as small a polynomial as possible.

1.2. BASIC BIOLOGICAL BACKGROUND

5

To conclude, when a computer scientist is faced with designing an algorithm for a computational problem there are typically four questions to be asked: • What is the most eﬃcient algorithm that can be designed? • If no eﬃcient algorithm exists can the problem be proved intractable? • Can an approximation algorithm be designed? • Can a good heuristic algorithm be designed? This thesis deals with computational problems arising in diﬀerent settings of phylogenetic tree reconstruction. For these problems all the aforementioned issues have to be taken into consideration. For a more detailed introduction to the subjects of algorithms, complexity, and approximation the reader is referred to the books Introduction to Algorithms by Cormen et al. [35] and Complexity and Approximation by Ausiello et al. [8].

1.2

Basic Biological Background

This section is intended for people with little knowledge of biology, while the rest of the thesis is purely computational. It provides basic background to the underlying evolutionary problems that this thesis deals with. Some generalizations are made and the reader is referred to the book Molecular biology of the cell [4] for a full and excellent exposition. Our planet is ﬁlled with a multitude of single- and multi-cellular organisms reproducing themselves by transmitting their genetic information, the genome, to their descendants. Cells are the smallest self-reproducing units and each cell contains one copy of the organisms genome stored in the form of deoxy-ribonucleic acid (DNA) molecules. The DNA molecule, Figure 1.2, is composed by the four nucleotides {A, C, G, T } and has the shape of a twisted ladder, the double stranded helix. Each step in the ladder is formed by a pair of nucleotides bonding either as A-T (adenine-thymine) or as C-T (cytosine-guanine). The two strands in the helix are, therefore, complementary, i.e. one strand can be deduced from the other. For example, the strand ”AGGGCT” is complementary to ”TCCCGA”. The double helical structure of DNA makes it particularly easy to replicate. DNA replication starts with the two DNA strands separating. This allows for the enzyme DNA polymerase to sweep along each of the two strands and bring in new nucleotides to assemble the complementary strands. The result is two ”exact” copies of the genome. Subsequently, the cell divides into two; each cell with one copy of the genome. A common analogy is that the genome is a blueprint describing the biological features of the cell. However, rather than one huge blueprint the genome is divided into several smaller blueprints called genes. For example, the human genome consists of about 6 billion base pairs, divided into 30,000-40,000 genes each of a

6

CHAPTER 1. INTRODUCTION

Figure 1.2: The DNA-helix.

few hundred to a few thousand base pairs. Genes convey genetic information for biological features but do themselves not perform functions. Instead the nucleotide sequence of genes are translated into amino acid chains called proteins which perform most of the functions in cells. For example, proteins act as enzymes catalyzing reactions, determine the shape and structure of the cell, generate movements, sense signals, etc. There are 20 amino acids and the translation from the four letter nucleotide sequences into amino acid sequences is determined by the rules of the genetic code. Each triplet of nucleotides, called a codon, speciﬁes one amino acid as described in Table 1.1. Although they are sequentially assembled, the amino acids along the chain interact with each other. This causes the protein to fold into a complicated 3-dimensional structure with reactive sites on the surfaces. It is this structure and the reactive sites which determine the proteins function. The conversion from genes to proteins is not immediate. It begins by the transcription of DNA into an intermediary messenger RNA molecule. Ribo-nucleic acid (RNA) is, as is DNA, composed of four nucleotides. It is, however, single stranded and one of the bases Thymine has been replaced by Uracil (U). Transcription of DNA into RNA is performed by the enzyme RNA polymerase which moves along the DNA helix temporarily separating the strands and at the same time assembling the complementary RNA strand. Subsequently, the mRNA is sent to a ribosome, a protein synthesizing machine, which translates the RNA sequence into the associated amino acid chain. Continuing with our analogy above, the mRNA is a light weight copy of the gene which is sent to a factory for production. The processes of DNA replication and creation of proteins from DNA is referred to as the central dogma. It holds in all types of self-reproducing organisms; eukaryota, bacteria, and archea 1 . In Figure 1.3 it is shown how gene expression works in eukaryotic cells, cells with a nucleus. Furthermore, it should be mentioned that 1 Viruses are not self-reproducing organisms. They hĳack the protein synthesizing machinery of the host organism to reproduce.

7

1.2. BASIC BIOLOGICAL BACKGROUND

there are exceptions to the central dogma; a large number RNA molecules perform functions themselves and are not translated into proteins. Eukaryotic cells store the genome in a nucleus while bacteria and archea do not have a nucleus. Another diﬀerence is that in bacteria and archea the genome is organized into one large circular chromosome. In eukaryotic cells, however, which in general have larger genomes, the genome is divided into a few separate linear chromosomes. Furthermore, multi-cellular eukaryotic organisms are diploid, meaning that each chromosome occurs in two copies; one from each parent. As was mentioned above, the human genome contains 6 billion base-pairs and about 30,000-40,000 genes of no more than a few thousand base pairs each. Straightforward arithmetic shows that the majority of the genome is non-coding and does not result in proteins. Some of the non-coding regions do not have known functions and was previously called junk DNA. Other parts of the non-coding regions play an important role in the regulation of gene expression. These regions are usually short sequences, 5-30 nucleotides long, and called regulatory elements. Amino acid Isoleucine Valine Methionine Alanine Proline Serine Tryptophan Asparagine Glutamic Lysine

DNA codon

Amino acid Leucine Phenylalanine Cysteine Glycine Threonine Tyrosine Glutamine Histidine Aspartic Arginine

ATT, ATC, ATA GTT, GTC, GTA, GTG ATG GCT, GCC, GCA, GCG CCT, CCC, CCA, CCG TCT, TCC, TCA, TCG, AGT, AGC TGG AAT, AAC GAA, GAG AAA, AAG

DNA codon CTT, CTC, CTA, CTG, TTA, TTG TTT, TTC TGT, TGC GGT, GGC, GGA, GGG ACT, ACC, ACA, ACG TAT, TAC CAA, CAG CAT, CAC GAT, GAC CGT, CGC, CGA, CGG, AGA, AGG

Table 1.1: The 20 amino acids and the DNA triplets coding for them.

..................................................................................................................................................... ........................... ............... ............. ........ ....... ..... ..... .... ... ... ... ... . . ... ... Protein . .. ... ... . . . .. .. ... . .. . .. .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . ....... .. ...... .. ...... ...... .. . . . .. . . . ..... . .. .... .. ..... . . . . . .... .. . .. ... .. ... ... . . . . . ... .. . . ... ... . . .. . . . ... . . ... . . . . . RNA Protein . . . . . . .. . . . . . . . . . . . . . . . . .. . . . . ... . . ... . . . . . . ... .. ... . ... ... .. . . . . ... . . .. ... .. ... ... . . . . . . .... .. .. .. .... . .... . . . . .. . . . . . ..... ... .. ..... ...... .. ...... ... ......... ... ............................. .. ... ... .. ... ... ... ... ... . . ... .. ... Protein ... ... .. ... ... ..... .... ...... ..... .......... ...... ........................ ............... . . . . . . . . . . . . . . . .................................... . . . . . . . . . . . .................................................................................................

Cell

Ribosome

-

7

DNA

Nu leus

-

w

-

-

Figure 1.3: Gene expression in a eukaryote cell.

8

CHAPTER 1. INTRODUCTION

Mutations, Selection, and Evolution Mutations in a cell’s DNA occur randomly due to copying errors by the DNA polymerase and also due to exposure to radiation, chemicals, or viruses. Occasionally, mutations represent changes for the better, more probably it does not aﬀect the cell, and in many cases it causes damage to the cell and it dies. This is reﬂected in the law of natural selection; mutations occur randomly and the survival or extinction of the individual is determined by its ability to adapt to the environment. Through the course of time, mutations and selection change the genetic information and the original species evolves giving rise to new species. In multi-cellular organisms only the genetic material of the germ cells, e.g. eggs and sperms, are passed on to the oﬀspring. As a result, most mutations that are accumulated throughout the diﬀerent cells of multi-cellular organisms are not passed on. However, there is a large variety within populations of multi-cellular organisms. Therefore, when multi-cellular organisms reproduce the genome of the parents are combined to produce a new unique oﬀspring. Local Point Mutations During replication the DNA polymerase sometimes makes mistakes and assembles an incorrect nucleotide in the complementary strand. Most of the time this is noticed by other enzymes which correct the error, but on occasion the error remains in the genome and a mutation is passed down to descendant cells. Such mutations are called substitutions; one nucleotide is replaced by another. Other times the polymerase inserts or deletes nucleotides, a.k.a. indels. These three mutations are commonly referred to as local point mutations. Mutations occur randomly and at ”equal” rate all over the genome. Some parts are however more essential than others and subsequently, due to selection, these parts are more conserved as the organism evolves. One example of a particularly essential region to all self-reproducing organisms is the gene coding for DNA polymerase. Clearly, mutations in this gene can seriously damage the reproduction ability of the cell. Therefore, due to selective pressure, it has remained highly conserved throughout all organisms on the planet. Some other mutations are not aﬀected by selective pressure. For example, some nucleotide substitutions in codons do not result in a change of the amino acid they code and are termed neutral. Since these mutations carry little risk to the individual, they are steadily accumulated as genomes are passed down to the descendant cells. The aﬀect of an insertions of a single nucleotide into a gene is the opposite. Such insertions change all subsequent codons, a frame shift, resulting in a drastically changed protein and serious damage. Global Mutations As opposed to local point mutations, there are genome rearrangements which eﬀect large segments of the genome either by transposing (moving) or reversing them. Such segments are called mobile genetic elements. Genome rearrangements do not change gene products and are much less common than local mutations. Still they appear to have played a signiﬁcant role in the evolution of

1.2. BASIC BIOLOGICAL BACKGROUND

9

genomes, especially in the evolution of the fruit ﬂy, Drosophila. Such mobile elements are many times remnants from a viral or bacterial infection, i.e. a part of a genome from a virus or a bacteria that has been incorporated into the host genome. Another type of global mutation is the complete duplication of a segment. If the segment contains genes then the duplication gives rise to two copies the genes. Since both copies carry the same function, the selective pressure decreases making it possible for one of the copies to evolve into a gene with a diﬀerent function. In fact, most genes belong to larger families of genes which all have evolved from one common ancestral gene. Other global mutations aﬀect the organization of chromosomes such as fusion and ﬁssion of chromosomes. A chromosomal translocation is another event where the tails of two chromosomes are interchanged. While the mobile elements are caused by remnants of viral and bacterial infections the latter global mutations occur during cell division due mainly to errors in the biological process. Large segmental insertions and deletions also occur. See Figure 1.4 for a view of how global mutations have been part of the evolution of human and mouse species.

Figure 1.4: The picture shows the chromosomes of the human and mouse genomes. Each segment of the mouse genome is colored by the color of the human chromosome where the segment appears. The picture clearly shows that global mutations have played a central role in the evolution of human and mouse. Source: [99].

Chapter 2

Introduction to Computational Problems This introduction is meant to give background and to place the results of this thesis into context. The next chapter is meant to give a detailed exposition of the results. Here, we deal with distance-based and character-based phylogenetic tree reconstruction, computation of pairwise distances both with respect to DNA sequences and gene order data, ancestral sequence reconstruction, and also multiple alignments. The red line throughout the chapter is that of phylogenetic tree reconstruction. For a full and excellent exposition to this subject the reader is referred to Felsenstein’s book on phylogeny [61]. The evolutionary history of a set of species is a central concept in biology that is commonly described by a phylogenetic tree. Frequently it is the case that the phylogenetic tree is unknown and the only information available is the genetic information of the taxa, e.g., a set of extant species. It is, therefore, a fundamental problem to reconstruct the phylogenetic tree given information about the taxa. The ﬁrst to reconstruct evolutionary trees were Darwin and Haeckel in the middle of the 19th century. They used physical and morphological information about the species and from that they reconstructed trees in which similar species were grouped together. Today evolutionary trees are reconstructed from genetic information about the taxa. Most often the information is in the form of genetic sequences, but it can also be information about the gene order, or based on anything else where similarity between species can be measured. However, the underlying assumption is the same as that of Darwin and Haeckel; Two species with a close common ancestor are more similar than two species with a more remote common ancestor. All methodologies for phylogenetic tree reconstruction are deﬁned by the following two notions: 1. the objective function, i.e. a criterion by which trees can be compared, 11

12

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS 2. and the algorithm that is used to search the space of trees to ﬁnd the best tree.

The aim is that the objective function models our assumption of evolution and that the algorithm is good enough at ﬁnding the best tree. In general, there are two approaches to phylogenetic tree reconstruction; 1. distance matrix methods and 2. character-based methods. Distance matrix methods are the more universally applicable approach. The only input is a matrix of estimated pairwise distances between the taxa and the objective is to ﬁnd an edge weighted tree whose leaf-to-leaf distances are close to the input matrix. These estimated distances have, however, been estimated from the available genetic information about the taxa. In character-based methods, the approach is to use the actual genetic information and not to reduce it to pairwise distances. Further, each character-based method has an underlying assumption of the type of mutations that change the genetic information. The objective functions in character-based methods are thus deﬁned by the genetic information and the mutations that act on it. There are two basic types of optimization criteria for character-based reconstruction: parsimony and likelihood. Parsimony is the most straightforward approach; the objective is to ﬁnd the phylogenetic tree which explains the genetic information at the leafs using as few mutations as possible. Mutations are however known to occur at random and occasionally it happens that new mutations reverse the eﬀect of earlier mutations. Therefore, there are likelihood methods for tracing mutational events. In likelihood methods, there is an assumption of a known probabilistic model describing the probability of the mutational events and the objective is to ﬁnd the phylogenetic tree which is the most likely explanation of the genetic information at the leafs. This thesis concerns computational problems that arise in diﬀerent settings and stages of phylogenetic tree reconstruction. This includes computing distances, computing objective functions, and designing algorithms for searching the space of trees.

Organization of Introduction In the rest of this introduction, we describe computational problems that arise in diﬀerent settings of tree reconstruction. First, in Section 2.1, distance matrix methods are described together with the most common objective functions and algorithms. Moreover, the subject of accuracy from a distance-based perspective is approached. This introduces the paper Fast Neighbor Joining by Elias and Lagergren [49]. As mentioned, distance-based methods take as input a matrix of pairwise evolutionary distances between the taxa. This requires that pairwise distances can be

13 computed from genetic information corresponding to the taxa. In the following sections, we describe the three major settings of genetic information that is available for phylogenetic tree reconstruction: 1. genetic sequences that have evolved through substitutions,

2. genetic sequences that have evolved through local point mutations (substitutions, insertions, and deletions),

3. and genomes that have evolved through genome rearrangements. Section 2.2 introduces the most simple setting of phylogenetic tree reconstruction; when the genetic information of the taxa are DNA-sequences that have evolved only through substitutions. We discuss the character-based methods in this setting and present some common probabilistic models of DNA-sequence evolution. With respect to these models it is possible to compute the ML estimate of the actual number of mutations by correcting for the observed mutational events. This is the main concern of the paper Fast Computation of Distance Estimators by Elias and Lagergren [50]. In view of probabilistic models, an algorithm for tree reconstruction can be seen as a statistical estimator of the underlying tree it aims at reconstructing. Thus, due consideration has to be taken to consistency and convergence of these estimators. We discuss these issues with regard to various character-based methods and distance matrix methods. Thereafter, we discuss issues of reconstructing ancestral sequences in a phylogenetic tree with respect to such probabilistic models. This introduces the paper Reconstruction of Ancestral Genomic Sequences using Likelihood by Elias and Tuller [51]. Thereafter, in Section 2.3, it is described how tree reconstruction is performed when the taxa correspond to sequences that have evolved through local point mutations: substitutions, insertions, and deletions. In most such cases, there is a stage in the data processing in which the sequences have to be aligned into an alignment matrix such that homologous nucleotides are in the same column. Such multiple alignments allow for a more thorough investigation of the data using only substitutions, as described above. We introduce the most common optimization criteria for multiple alignments, discuss their complexity, approximation algorithms, and heuristics. The complexity of multiple alignment is the main concern of Settling the Intractability of Multiple Alignment by Elias [46]. Finally, in Section 2.4, we describe how pairwise distances can be computed from gene order data that have evolved through genome rearrangements. This introduces the paper A 1.375-Approximation Algorithm for Sorting By Transpositions by Elias and Hartman [48].

14

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

2.1

Phylogenetic Tree Reconstruction using Distance Methods

This section introduces the problem of constructing phylogenetic trees from pairwise distances. First the most important optimization criteria are introduced followed by an exposition on the diﬀerent heuristic algorithms. At the end, we discuss the issue of accuracy from a distance-based perspective. Phylogenetic trees are used to represent the evolutionary history of a set of taxa. In the setting of distance matrix methods, such trees are accompanied by an edge length function, where each edge length represents the amount of evolutionary change between the two associated nodes. Such edge weighted trees naturally induces an additive leaf-to-leaf distance function which uniquely corresponds to the underlying edge weighted tree, see Figure 2.1. If the tree is T and the edge length function is w then we denote the leaf-to-leaf distance between leaf i and j as DTw [i, j]. The general idea of distance matrix methods is to reconstruct the edge weighted tree from a distance matrix of estimated pairwise distances between the taxa. The aim is that the additive distance matrix of the edge weighted tree should be as close as possible to the estimated distances. This simple formulation has led to a large variety of optimization criteria for ﬁnding the closest tree. Today distance matrix methods is the largest family of tree reconstruction methods. It should be mentioned that although the edge lengths are of interest, it is the tree which is the most essential parameter in tree reconstruction. Tree T A

A DTw = B C D

... ..... ..... .... ..... ..... .... ..... . . . ..... . .... .... .... ..... ..... ..... .... ..... ..... ..... .... ..... .... ..... ..... ............................................................................................................................................................. .... ..... ..... .... . . . . ..... ..... .... ..... ..... .... ..... ..... .... .. .....

2

D

1 1

C

3

1

A 0

B 3 0

C 6 5 0

D 6 5 2 0

B

Figure 2.1: A tree with edge lengths and the associated additive distance matrix.

Optimization Criteria In 1967, Cavalli-Sforza and Edwards [27] proposed the least squares criterion for reconstructing an edge weighted tree from a matrix of pairwise distances between the leafs: X (D[i, j] − DTw [i, j])2 , argmin(T,w) i,j

DTw [i, j]

where is the leaf-to-leaf distance between leaf i and j and D the matrix of estimated distances. In addition to the ℓ2 matrix norm, which corresponds to the least squares estimate, there are optimization criteria under other matrix

2.1. PHYLOGENETIC TREE RECONSTRUCTION USING DISTANCE METHODS

15

norms such as ℓ1 and ℓ∞ norms. The ℓ2 norm is, however, statistically justiﬁed and is, therefore, the most common criterion. The justiﬁcation is based on the assumption that the estimated distances in the input matrix are observations from independent normally distributed distances in the true edge weighted tree. Under this assumption, the least squares estimate of Cavalli-Sforza and Edwards is the standard statistical approach for obtaining the most likely estimate, when the variances of the distributions are assumed to be 1. Also in 1967, Fitch and Margoliash [64] introduced a least squares estimate in which the variance was allowed to differ. Furthermore, there are least squares estimates in which the distances are not assumed to be independent [29]. Cavalli-Sforza and Edwards proposed the least squares estimate and showed how the least squares estimate of the edge lengths can be computed for a given tree in time O(n3 ), this has subsequently been improved to O(n2 ) [66]. However, as was shown by Day in 1986 [39], ﬁnding the least squares edge weighted tree is NP-hard. Furthermore, it is NP-hard to select the closest edge weighted tree under the ℓ1 [39] and ℓ∞ [2] matrix norms. Therefore, it is unlikely that exact polynomial algorithms exist, but several heuristics have been proposed. Minimum evolution (ME) [82, 105] is another popular criteria for distance methods. In ME, the objective is to ﬁnd the edge weighted tree with total shortest edge lengths when these are deﬁned by the least squares estimate. Furthermore, there are criteria that not are based on matrix norms. In particularly, those that are used in quartet methods, which are mentioned below.

Algorithms The work of Cavalli-Sforza and Edwards is based on the work on clustering algorithms by Sokal and Sneath from 1963 [113]. The ﬁrst clustering algorithm, called UPGMA, was introduced by Sokal and Michener in 1958 [114] and since then a large number of clustering algorithms have been invented [114, 107, 49, 65, 21, 110]. These build a binary tree from a star tree by iteratively picking a pair of sibling taxa, creating a new node representing their parent, and computing a new reduced distance matrix in which the siblings have been replaced by their parent. Thus, there are two properties that deﬁne a clustering method: (1) the selection criterion that is used for selecting a pair of siblings and (2) the reduction function used for computing the new matrix. Clustering methods do in general not optimize any of the aforementioned criteria. They should rather be seen as recipes for constructing a tree according to a set of rules. These rules are, however, many times based on the criteria mentioned above and the algorithms can, therefore, be seen as heuristics for these criteria. The Neighbor Joining (NJ) algorithm, seen below, is a clustering method introduced by Saitou and Nei in [107], in Studier and Keppler [116] the running time was improved to O(n3 ). My impression is that NJ is the most widely used tree reconstruction method and that it has been embraced due to its simplicity, speed, and accuracy. At each step it uses a form of minimum evolution to select which

16

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

pair of taxa to cluster. NJ can thus be seen to approximate the minimum evolution criterion and may also be seen as a rough estimation of the least squares estimate. Algorithm NJ(D) 1. while the number of nodes in D > 3 do a) Select a sibling pair by P (a, b) ← argminx6=y∈D (n − 2) · D[a, b] − z∈D D[z, a] + D[z, b]

b) Reduce a and b to a new node c representing their parent. Remove a and b from D and add c by setting the distance to any other node z as D[z, c] , D[z, a] + D[z, b] /2 c) Connect a and b to c by adding edges (a, c) and (b, c).

2. Connect the three nodes of D in a star and return the resulting tree. There are other methods which attempt to optimize one of the criteria mentioned above. However, since the problems are NP-hard there are probably no polynomial algorithms. Therefore, most algorithms that attempt at actually optimizing one of the criteria adopt heuristic hill climbing. The basic idea in these methods, so called branch swapping algorithms, is to perform local changes in trees to improve the objective, such as disconnecting a subtree and then reconnecting it at some other place. In the case of the least squares criterion, there are two such methods: Felsenstein’s method [60] and Makarenkov and Leclerc [91]. For minimum evolution the algorithm by Desper and Gascuel [41] also attempts at ﬁnding a local optima through branch swapping. Quartet methods 1 , popularized by Buneman [23], use a diﬀerent set of optimization criteria that are not based on matrix norms. Here, the idea is to build small trees of only four species and then to puzzle these together using diﬀerent rules. The ﬁrst such method was the Buneman tree algorithm [23] which has O(n3 ) running time [18] and constructs a tree containing all quartets (a, b)(c, d) s.t. min( D[a, c] + D[b, d] , D[a, d] + D[b, c] ) − D[a, b] − D[c, d] > 0. The idea of using quartets to build a tree has been extended further in the DCM algorithm [77, 85] which builds a set of subtrees, using some other algorithm as a subroutin, and then merges these together to create a tree for all taxa. The only implementation of DCM that is of practical interest uses NJ to construct the subtrees and it requires O(n5 ) time. The algorithm FastNJ presented in Elias and Lagergren yields a variant of DCM with running time O(n4 ).

1 It should be mentioned that quartet methods also are applicable when the input consists of quartets, rather than a distance matrix from which the quartets are constructed.

2.1. PHYLOGENETIC TREE RECONSTRUCTION USING DISTANCE METHODS

17

Accuracy of Algorithms Most objective functions in tree reconstruction from distance functions have been shown to be NP-hard. However, it is possible to give conditions on the input matrix under which algorithms do ﬁnd the correct tree. One particularly desirable property for a distance method is to reconstruct the unique edge weighted tree (T, w) given an additive distance function DTw . For example, the UPGMA algorithm [114] does not have this property while the NJ algorithm [66] does. With regard to accuracy of tree reconstruction, the big quest is to give a comprehensive set of conditions under which diﬀerent methods reconstruct the correct tree. Atteson [7] was the ﬁrst to approach this subject. He termed a distance matrix D nearly additive if there is an additive distance function DTw such that

|D − DTw |∞
|D − DTw |∞ . This last property is referred to as the edge radius of the algorithm. In practice, most distances are far from being nearly additive. Thus, although important, optimal reconstruction radius is not suﬃcient for an algorithm to be useful in practice. Moreover, the notion of reconstruction radius only includes some of the matrices for which a method correctly reconstructs the tree. Therefore, methods cannot be compared only with respect to the reconstruction radius. For example, the recent DLCA algorithm by Gronau and Moran [68] has optimal reconstruction radius, optimal edge radius, and optimal runtime complexity. However, in practice the output from this algorithm is not necessarily more accurate than that of NJ and FastNJ. On page 22 we approach the subject of accuracy under probabilistic models of sequence evolution. For a listing of the conditions under which diﬀerent algorithms reconstruct the correct tree see Table 2.1

18

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS Tree

Input

T d

d...

a

..... .... .... .... ..... ..... ..... ..... ..... ..... ................................................................................................ .... ... ..... .... ..... ..... ..... ..... ..... ..... .. .....

c

µ

distan es

D a

..... ... .... .... ..... ..... ..... ..... .... ..... ..... ..... . ..... . . . ............................................................. . . . . .... . . . ... . . . . .... . . . . .... . . . ... . . . . .... . . . . .... . . . ... . . . ............................................................. . . . ..... ..... . ..... . . ..... .... . . . ..... .. ..... .... ..... ..... . .....

Tree

d

µ/2

c

b

a

µ

µ/2

b

S

. .... .... ..... ..... ..... ..... ..... .... ........ .......... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .......... . . . . .... ........ . . . . ..... ..... ..... . . . . .... .....

c

b

Figure 2.2: The optimal reconstruction radius. The distance matrix D is on the same distance to both trees, i.e. |D − DTw |∞ = µ/2 and |D − DSw |∞ = µ/2.

UPGMA [114] NJ [107, 7, 49, 94] BioNJ [65] FNJ [49, 94] ADDTREE [110, 7] Buneman [23, 7] DLCA [68]

Complexity O(n2 ) O(n3 ) O(n3 ) O(n2 ) O(n4 ) O(n3 ) O(n2 )

Radius 0 1/2 1/2 1/2 1/2 1/2 1/2

Edge Radius 0 1/4 1/4 1/4 1/2 1/2 1/2

Table 2.1: A table overlooking the conditions under which some important algorithms reconstruct the correct tree. The runtime complexity is expressed in the number of taxa n, while the size of the input is a matrix of size O(n2 ). Both the reconstruction radius and edge radius are expressed in terms of the factor which provides the upper bound to the error, e.g. NJ correctly infers all edges that are longer than l(e)/4 > |D − DT |∞ and therefore it has edge radius 1/4.

2.2

Tree Reconstruction based on Sequences that have Evolved through Substitutions

This section contains an introduction to phylogenetic tree reconstruction from genetic sequences that have evolved only through substitutions. We describe the character based methods to tree reconstruction, present some important models of DNA-sequences evolution and how estimated pairwise distances are computed in these models. Further, we approach the subject of statistical accuracy and reconstruction of ancestral sequences. As was mentioned in the beginning of this chapter, the character based approach to tree reconstruction uses the actual genetic information of the taxa and type of mutations that have caused their divergence. The most simple type of genetic information available is that of DNA sequences that have evolved only

2.2. TREE RECONSTRUCTION BASED ON SEQUENCES THAT HAVE EVOLVED THROUGH SUBSTITUTIONS

19

through substitutions. There are few real world biological cases of sequences that have evolved without insertions and deletions of nucleotides. However, the regular approach to tree reconstruction of genetic sequences that have evolved through all three types of local point mutations is to ﬁrst create a multiple alignment, as describe in the following section. Thereafter, the common approach is to ignore all positions in which insertions and deletions have occurred and to analyze the remaining positions using only substitutions as done in this section.

Character Based Optimization Criteria Parsimony is one of the ﬁrst character based criterion for tree reconstruction that was introduced in 1963 by Edwards and Cavalli-Sforza [43]. In parsimony, the input is the sequences in the taxa and the objective is to ﬁnd the most parsimonious (MP) tree, i.e., the tree which explains the taxa with as few substitutions as possible. This is a simple and appealing approach, since it does not assume any statical model of sequence evolution. On the other hand, this is the very reason why MP is criticized for not being statistically sound. Since mutations occur at random, over time there is an increasing number of substitutions which reverse the change of some earlier substitutions. One problem with parsimony is that it fails to take such reversals into consideration. The more common approach is therefore, to use probabilistic models of sequence evolution which describe how nucleotides evolve independently and identically according to some Markov model from the root down toward the leafs. In this statistical framework each edge has an associated probability of change and, thus, each phylogenetic tree has a probability, a.k.a. the likelihood, of generating the observed sequences at the leafs. Likelihood methods were ﬁrst introduced by Edwards and Cavalli-Sforza in 1964 [44], but was to a large extent popularized by the work of Felsenstein [58]. The most common likelihood based criterion is to seek the phylogenetic tree, together with edge lengths, which has the largest probability averaged over all possible assignments to the internal nodes of generating the observed sequences at the leafs. This criterion is commonly referred to as Maximum Likelihood (ML). However, there are other criteria, for a listing see Steel and Penny [115], one of which is Ancestral Maximum Likelihood (AML) which is described at the end of this section. Also parsimony can be seen as a likelihood based approach if the rates of evolution on each branch of the tree can vary freely from site to site. Both MP and ML are NP-hard [39, 32] optimization problems. This probably rules out polynomial algorithms for ﬁnding optimal solutions, but there are a large number of heuristics. The most common heuristics for ﬁnding good trees [106, 59] under both criteria is to start with an initial tree and then perform branch swapping as described on page 16. In addition, distance methods can be used as heuristics for character based optimization criteria. For MP the approach is to compute all pairwise hamming distances and then apply a suitable distance method. Similarly, for ML the approach is to compute the most likely estimate of each pairwise distance and then apply the distance method. Since the genetic information is reduced to

20

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

pairwise distances the distance-based methods are generally slightly less accurate then the character based heuristics. However, distance-based methods are usually much faster and also easier to implement.

Models of DNA Evolution and Distance Estimation As mentioned above, mutations occur at random and therefore over time there is an increasing probability that substitutions reverse the change of some earlier substitutions. Here we present some probabilistic models of sequence evolution and describe how estimated pairwise distances are computed in these models. There are various Markov models which describe how sites evolve independently and identically over time from the root down toward the leafs. The simplest model is the twostate Cavender-Farris-Neyman model (CFN) [28, 55, 98]. In this model, there are two symbols (0 and 1) and each edge is associated with a probability of mutation. Although CFN applicable in some biological settings, it is mainly of theoretical interest. Markov models for DNA evolution consist of four states, one for each base, and for each state transition there is an associated rate of change per time unit. For each such model, speciﬁc nucleotide, and elapsed time t it is possible to compute the probability of observing a substitution from the nucleotide to any of the three other bases. Moreover, it is possible, either numerically or through a closed formula, to compute the most likely estimate of the actual number of substitutions between two sequences by counting the observed number of substitutions and reversing the computation of the Markov model. The simplest of these models is the Jukes-Cantor model [78], Figure 2.3A. In this model all state transitions have the same rate, denoted α. Since each nucleotide can mutate into three other bases after time t the expected number of substitutions in one position is t3α. Using the properties of the Poisson distribution, it can be shown that the probability of observing a change n1 → n2 after time t is P r[n2 |n1 , t] =

1 1 − e−4αt 4

and the probability of observing any of three possible substitution is, therefore, 3 1 − e−4αt ). 4 Thus, given α and t, we have the probability of change and can compute the likelihood of observing two sequences of length l on Hamming distance h as ph · (1 − p)l−h , where p is the probability of change. Moreover, if α and t are unknown we can invert the function to compute the ML estimate of the actual number of mutations, t3α, as 4h 3 . − · ln 1 − 4 3l

2.2. TREE RECONSTRUCTION BASED ON SEQUENCES THAT HAVE EVOLVED THROUGH SUBSTITUTIONS

21

α α A G A G 6@ 6@ I 6 I 6 @ @ @ @ α α @ β β @ @ α α @ β β @ @ @ @ ? @ R ? ? @ R ? C α -T C α -T A B Figure 2.3: (A) The Jukes-Cantor model. (B) Kimura’s two-parameter model, with α as the transition rate and β as the transversion rate.

In the Jukes-Cantor model the probability of mutation is the same for all nine diﬀerent state transitions. Since this is not the case in reality there are more general models. In Kimura’s two-parameter model [83], Figure 2.3B, which is a generalization of the Jukes-Cantor model, there are two diﬀerent rates of change. The model, distinguishes between two types of state transitions; transitions which are changes within the purine (A and T) and pyrimidine (C and G) groups and transversions which are changes between the groups. As with the Jukes-Cantor model, given the model parameters, α and β, it is possible to compute the likelihood of observing two divergent sequences. Moreover, there is a closed formula for the ML estimate of the actual number of mutations which is given by 1 1 1 1 · ln + · ln , 2 1 − 2P − Q 4 1 − 2Q where P and Q are the observed rates of transitions and transversions, respectively. Both the Jukes-Cantor model and Kimura’s two-parameter model assume that the equilibrium distribution of nucleotides is uniform. This is an unrealistic assumption and there are more sophisticated models such as Felsenstein-84 [84] and the Tamura-Nei model [119] which allow for arbitrary base frequencies. In addition, the Tamura-Nei model distinguishes between purine-transitions, A ↔ G, and pyrimidine-transitions, C ↔ T . The Jukes-Cantor model and Kimura’s two-parameter model are the only two used models with known closed formulas for the ML estimate. However, both the Felsenstein-84 model and the Tamura-Nei model have well-behaved consistent estimators with closed formulas. Moreover, the ML estimate can be computed numerically. The approach to computing the ML estimate of the actual number of mutations is, therefore, to ﬁrst count the number of transitions (possibly purineand pyrimidine-transitions) and transversions and thereafter compute the estimate, either numerically or exactly. Consequently, for two sequences of length l it takes time Ω(l) to compute the estimate of the actual number of mutations. In distance-based tree reconstruction all pairwise distances have to be estimated between n sequences of length l. Therefore, the straightforward algorithm of

22

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

computing all pairwise distances between n sequences has running time Θ(l · n2 ). In theory, this computation can be accomplished in time O(l · n1.376 ), see Elias and Lagergren [49], however the only practical algorithms require Θ(l · n2 ) time. The Neighbor-Joining [107], which is the most common distance method, has running time O(n3 ). This means that since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. In Elias and Lagergren [50], a fast practical algorithm is presented which takes two sequences and counts the number of transitions and transversion. There are a large number of other issues regarding distance estimation, e.g., rate variation between sites, ambiguity symbols, models for amino acid evolution etc. For a thorough exposition of all issues in distance estimation the reader is referred to the book Inferring Phylogenies by Felsenstein [61]. Furthermore, methods for handling ambiguity symbols are investigated in Elias and Lagergren [50].

Statistical Accuracy - Consistency and Convergence In view of Markov models for sequence evolution, the sequences representing the taxa can be seen as having been generated by a random process over the underlying tree. This random process is determined by the rates in the Markov model and the elapsed time on the edges of the tree. Tree reconstruction methods can, therefore, be seen as statistical estimators of these parameters. Consequently, due consideration has to be taken that the estimators/methods are consistent, meaning that they converge to the correct tree as the sequence length, the number of sites sampled from the tree, approaches inﬁnity (see Figure 2.4). Since the tree is the most important parameter, in practice, tree reconstruction algorithms are termed consistent if they are consistent estimators of the tree alone, i.e., disregarding the edge lengths and the rates in the Markov model. As mentioned above, for the Markov models of DNA evolution it is possible to invert the model and to compute the ML estimate for the actual number of mutations. These estimates are consistent in the sense that the estimated distance approaches the actual distance as the sequence length approaches inﬁnity. As a result, due to the accuracy of distance methods (see page 17), the NJ algorithm and many other distance methods are consistent methods for recovering the correct tree, i.e., NJ reconstructs the correct tree given distances estimated from inﬁnitely long sequences. With respect to character based methods the issue of consistency has been wildly debated, see Steel and Penny [115] for a review. In 1978, Felsenstein [57] showed that MP is an inconsistent estimator and, in 1996, Chang [30] showed that ML is consistent. The latter result is based on that the correct model is known before hand, otherwise ML is actually inconsistent [31]. Moreover, it is not be forgotten that both MP [39] and ML [32] have been shown NP-hard. Therefore, since heuristics are not guaranteed to ﬁnd an optimal solution, heuristics can be consistent or inconsistent independent of the criteria they aim at optimizing. In

2.2. TREE RECONSTRUCTION BASED ON SEQUENCES THAT HAVE EVOLVED THROUGH SUBSTITUTIONS

23

NJ vs. FNJ for 400 leafs 0.6 NJ FNJ

0.55 0.5 0.45

Avg. RF rate

0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

200

400

600

800

1000

1200

1400

1600

1800

2000

Sequence length

Figure 2.4: The performance of NJ and FastNJ on simulated data with 400 leafs. Notice how both algorithms converge to the correct tree (0 on the y-axis) as the sequences length increases. The y-axis corresponds to the normalized RobinsonFoulds distance between the correct tree and the constructed tree. The RobinsonFoulds distance is the number of edges (splits of the leafs) that are not part of both trees. For example, for a sequence length of 1000 more than 85% of the edges are correctly inferred by both NJ and FastNJ.

Elias and Tuller [51], another likelihood based criterion, AML, is shown to be an inconsistent estimator of the tree. In practice, sequences are of limited length and therefore it is not enough for a method to be consistent. Instead, algorithms have to with high probability construct the correct tree from sequences of reasonable length. It has been shown that NJ requires sequences that have a length that is exponential in the number of taxa to reconstruct the correct tree with high probability. From this has sprung the development of fast-converging algorithms, i.e., algorithms that reconstruct the correct phylogeny from sequences whose length is polynomial in the number of sequences [52, 77, 85, 37]. However, except from the Disc-Covering Method (DCM) [77, 85] these algorithms have had little or no practical impact. The only variation of DCM that is fast-converging and of practical interest uses NJ to construct small subphylogenies that are later patched together into one larger phylogeny, i.e., NJ is used as a subroutine. Further, in Daskalakis et al. [38] an algorithm is presented which with high probability is expected to reconstruct the correct tree from √ √ ≈ 0.146. sequences of length O(log n) if no edge has change probability > 22−1 2

Ancestral Sequence Reconstruction As mentioned above, tree reconstruction methods are termed consistent if they are consistent estimators of the tree. However, there are also other parameters that are essential to the process that generated the sequences: the rates of change, the edge

24

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

lengths, and also the ancestral sequences. A challenging task in computational biology is the reconstruction of the ancestral sequences, given the phylogenetic tree. Such sequences can give clues to the selective pressure at diﬀerent sites in the sequences. Ancestral sequences can be reconstructed with respect to the same criteria as for tree reconstruction, only that the tree is ﬁxed. Since one of the underlying parameters (the tree) is part of the input the new formulations are easier to solve computationally. For example, the most parsimonious ancestral sequences can be computed with dynamic programing in time O(nl) using Fitch’s algorithm [63]. With respect to the ML criterion, the problem has to be solved by ﬁrst estimating the edge lengths and the rates in the Markov model, thereafter the most likely ancestral sequences can be inferred. This version of ML, called small ML, does not have a polynomial algorithm on the other hand there is no intractability result. However, if the tree in addition to the edge lengths and the rates in the Markov model are known then the most likely estimate of the ancestral sequences can be compute using dynamic programming similar to Fitch’s algorithm [104]. In the ML criterion the objective is to construct the phylogenetic tree with edge lengths which has the largest probability, averaged over all possible assignments to the internal nodes, of generating the observed sequences at the leafs. Another important likelihood criterion is the Ancestral Maximum Likelihood (AML) criterion. Here the objective is to ﬁnd the most likely estimate to all parameters together: ancestral sequences, edge lengths, and rates in the Markov model. Also for AML there is no polynomial algorithm nor intractability result. However, in Elias and Tuller [51] approximation algorithms, ﬁxed-parameter tractable algorithms (FPT), and fast heuristic algorithms for two variants of the problem are presented: when the phylogenetic tree is known and when it is unknown. The variant of AML when the tree, in addition to the ancestral sequences are sought, has been shown NP-hard by Berry et al. [1].

2.3

Multiple Sequence Alignment - Local Point Mutations

In this section, we consider sequences that have evolved through substitutions, insertions, and deletions. We introduce the diﬀerent optimization criteria for Multiple Alignment. Further, we discuss the issues of complexity, approximation algorithms, and heuristics. Most mutational events in a genome are local point mutations that edit the DNA sequences by substitutions of single nucleotides and indels (insertions and deletions) of short segments. These events directly change biological features by eﬀecting gene products and gene expression. Similarity between DNA sequences is, therefore, a good indicator of evolutionary origin and conservation of essential biological features. As mentioned in the beginning of the previous section, tree reconstruction from sequences that have evolved through local point mutations is usually performed by ﬁrst computing a multiple alignment, as described in this

2.3. MULTIPLE SEQUENCE ALIGNMENT - LOCAL POINT MUTATIONS 25 section, and thereafter analyzing the data using substitutions. The algorithm by Ulitsky et al. [122] is an exception which computes pairwise distances from unaligned data. An alignment of sequences, see Figure 2.5, is a matrix with one sequence in each row. In alignments homologous nucleotides are identiﬁed by positioning them in the same column, with gaps representing insertions and deletions. A central question is how to score diﬀerent mutational events. In the most elementary setting, the goal is to ﬁnd an alignment of two sequences containing the least number of mismatching nucleotides. This type of alignment explains the least number of mutations needed to transform one sequences into the other, a.k.a. the edit distance. This problem was ﬁrst studied 1965 by Levenshtein [87], with no relation to biology, who gave O(l2 ) dynamic programing algorithm for solving the problem, where l is the length of the sequences. a a a

a c -

g g g g

a a a t

g g g

c c c c

t t t

Figure 2.5: Example of a multiple alignment of four sequences. The gap symbol, ’-’, represent the insertion or deletion of a nucleotide. Although elementary, the edit distance does not capture the biological reality that insertions and deletions are less frequent than substitutions. Therefore, there are other functions for scoring columns that more accurately model the mutational distance between nucleotides. As with all distances, it is desirable to restrict the symbol distance to a metric (see page 36 for formal deﬁnition). That is, the properties of identity, symmetry, and triangle inequality should hold for the distances between symbols in the alphabet. The metric by which the edit distance is found is known as the unit metric. In 1970, Needleman and Wunsch [97] were the ﬁrst to apply dynamic programming to align biological sequences. Their approach was to, rather than minimizing evolutionary distance, maximize the similarity between homologous nucleotides. Ever since, numerous similarity scores have been proposed both for nucleotides and amino acids sequences [40, 74]. The Needleman-Wunsch algorithm constructs a global alignment. In 1981, Smith and Waterman [112] proposed a dynamic programming algorithm for ﬁnding the most homologous subsequences in what is naturally referred to as a local alignment. Both the Needleman-Wunch algorithm and the Smith-Waterman algorithm have been the foundation for a large set of heuristics algorithms which allow for fast homology searches of gigantic sequence databases, e.g., FASTA [90] and BLAST [6]. Moreover, the runtime complexity of O(l2 ) has been improved to O(l2 / log l) in [92, 36]. These improvements are however not practical.

26

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

When sequence similarity is weak it may be diﬃcult to capture the true evolutionary events with a pairwise alignment. This is because there is not enough information available to make a good assessment from only two related sequences. Therefore, more related sequences are included in the alignment in what is naturally referred to as a multiple alignment. Since a multiple alignment communicates information about the common evolution of a set of sequences, they are both more accurate and also more informative about the evolutionary process. As stated in [76] ”pairwise alignment whispers... multiple alignment shouts out loud.” Multiple alignment is a core problem in computational biology [86] that has received much attention over the years, both in the line of heuristics and hardness results. It is a prerequisite to many computational problems in which several related sequences are investigated, such as: phylogenetic tree reconstruction, protein structure prediction, and localization of regulatory sequences.

Optimization Criteria, Complexity, and Approximation In a multiple alignment, there are several pairwise relations in each column; as opposed to pairwise alignment, where there is only one pairwise relation in each column. Therefore, it is not apparent what scoring function to use for columns in a multiple alignment. Over the years many schemes have been suggested, the most common variations are: the SP-score (sum-of-pairs score), Star Alignment, and Tree Alignment (for a given phylogeny). In the SP-score, the sum of all pairwise distances in each column are summed to give a total score for the alignment matrix (see page 37 for formal deﬁnition). Over the years, the SP-score has received a lot of attention [9, 70]. However, no asymptotically faster algorithms have been proposed than the straightforward generalization [126] of the dynamic programming algorithm for two strings and Qn it has a runtime complexity of O(2n i=1 li ), were n is the number of sequences and li the length of the i’th sequence. In practice, there is a clever algorithm called MSA [26] which reduces the volume of the multidimensional dynamic programming matrix. Another very impressive algorithm is that of Althaus et al. [5], which solves an integer programming formulation of the problem. A signiﬁcant number of heuristics exist for the diﬀerent multiple alignment formulations [121, 62], to mention just a few. Most of them are based on progressive alignment which was ﬁrst proposed 1987 by Feng and Doolittle [62]. In progressive alignment a multiple alignment is built by successive pairwise alignments of sequences and groups of sequences. This approach has turned out to be useful and is, beyond any doubt, the most used approach for actually attaining multiple alignments. Gusﬁeld [70] was the ﬁrst to propose a polynomial time approximation algorithm for the SP-score. The algorithm achieves a 2 − n2 approximation ratio and relies on on the symbol distance being a metric. Bafna and Pevzner [11] improved the ratio to 2− nk for any ﬁxed constant k. Thereafter, in the seminal work of Wang and Jiang [123], a short NP-hardness proof was provided for the SP-score under a non-metric

2.4. GENOME REARRANGEMENTS - GLOBAL MUTATIONS

27

symbol distance over a 4 symbol alphabet. This result was improved by Bonizzoni and Vedova [20], who showed that the problem is NP-hard for the binary alphabet and a speciﬁc metric. The result was extended further by Just [79] to cover many metrics and it was also proved APX-complete under some non-metrics. However, all metrics were not covered and in particular not the unit metric; the most elementary formulation of the problem. The intractability was settled in Elias [46], where it was shown intractable for all metrics over binary or larger alphabets. In Star Alignment, the score is given by the distances from all symbols in the column to the consensus symbol for the column (see page 37 for formal deﬁnition). The star alignment score has the same 2-approximation as do the SP-score. In Wang and Jiang [123], Star Alignment was proved to be APX-complete over a 7 symbol alphabet, however the symbol distance did not have the property of identity nor that of triangle inequality. Li et al. [88] gave a PTAS and an NP-hardness result under the unit metric for a variant of Star Alignment, in which there was a restriction on the number of gaps. Moreover, in Sim and Park [111] the problem was proved NP-hard for a 6 symbol metric. Finally, in Elias [46] the problem was shown intractable for all metrics over binary or larger alphabets. In Tree Alignment, the input is a phylogenetic tree with sequences at the leafs (see page 37 for formal deﬁnition). Here the score of a column is the least number of mutations needed to explain the symbols of the column when these label the associated leafs of the tree. When the symbol distance is a metric this problem is a Steiner tree problem in a metric space. Wang and Jiang also proved that Tree Alignment is NP-hard for a speciﬁc metric over a 4 symbol alphabet. Later, in two companion papers Wang and Jiang [125, 124] gave a couple of nice PTASs working for all metrics. Finally, in Elias [46], the problem was shown intractable for all metrics over binary or larger alphabets. This result completely settles the complexity of the tree alignment score which cannot have an FPTAS unless P=NP [8].

2.4

Genome Rearrangements - Global Mutations

In this section, we introduce the subject of distance estimation and tree reconstruction from gene-order data. After a general introduction, we discuss the problems of unsigned and signed sorting by reversals. Thereafter, we discuss sorting by transpositions and mention some other relevant genome rearrangement problems. The occurence of genome rearragnements was ﬁrst observed in the 1930s by Dobzhansky and Sturtevant [42]. They studied the banded structure of the Drosophila chromosomes and observed that segments in chromosomes had been reversed. Later in the 1980s, Palmer et al. [100, 101, 102, 75] observed that several species have essentially the same set of genes, but that their gene order diﬀers. This suggests the use of genome rearrangements for tracing evolution. When estimating the evolutionary distance between two organisms using genomic data one wishes to reconstruct the sequence of evolutionary events that transformed one genome into the other. As opposed to local point mutations, which

28

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

are traced using alignments, global rearrangements are rare and may, therefore, provide more accurate and robust clues to the evolution of a set of species. Another advantage is that compared to substitutions (page 19), the number of possible rearrangements is large and therefore it is less likely that rearrangements reverse the eﬀect of earlier rearrangements. Moreover, since genome rearrangements take the whole genome into account, gene-order data can be used for analyzing the evolutionary process of the whole organism rather than the evolution of a few genes. In the remainder of this exposition, we make two common simplifying assumptions; we assume that genomes consist of a single linear chromosome and that there is a one-to-one mapping between the genes in the two genomes, i.e., genes are assumed not to be duplicated. At the very end of this section, we give references to several works related to these biologically relevant issues. In the last decade, a large body of work was devoted to genome rearrangement problems. Genomes are represented by permutations, with the genes appearing as elements. The basic task is, given two permutations, to ﬁnd a shortest sequence of rearrangement operations that transforms one permutation into the other. Assuming that one of the permutations is the identity permutation, the problem is to ﬁnd the shortest way of sorting a permutation using a given rearrangement operation, or set of operations. There are three basic types of genome rearrangement events; 1. transpositions, a segment is cut out and inserted in a diﬀerent position as (1 2 3 4 5 6) → (1 4 5 2 3 6), 2. reversals, a segment is reversed as (1 2 3 4 5) → (1 4 3 2 5), 3. and trans-reversal, a segment is cut out reversed and inserted in a diﬀerent position as (1 2 3 4 5 6) → (1 4 5 3 2 6). Ideally, we seek the best rearrangement scenario using all the aforementioned operations. However, to date most works in the ﬁeld have studied simpliﬁed versions of the problem in which only subsets of all the operations are allowed (in addition to the simplifying assumptions above). The basic task in genome rearrangements is to compute the rearrangement distance, i.e., the shortest sequences of rearrangements needed to transform a permutation into the identity. With respect to tree reconstruction, page 2, this is equivalent to the parsimonious distance between two permutations. Furthermore, there are, as with genomic sequences, a large number of other computational problems related to gene-order data and evolution. Some other important problems include: 1. What is largest possible rearrangement distance? This is also known as the diameter of the associated Cayley graph. 2. Given a phylogenetic tree with permutations at the leafs compute the most parsimonious ancestral permutations.

2.4. GENOME REARRANGEMENTS - GLOBAL MUTATIONS

29

3. Given a set of permutations compute the most parsimonious phylogenetic tree. As with genomic sequences, there are probabilistic models for genome rearrangements [96]. These give rise to likelihood based questions such as; the expected distance, the most likely tree, etc. However, as mentioned, the number of possible global mutations is large and therefore it is less likely that rearrangements reverse the eﬀect of earlier rearrangements. Therefore, and also due to the complex analysis of rearrangements, parsimony based questions are more studied than likelihood based questions. Next we describe the (unsigned) Sorting By Reversals problem. Thereafter, we introduce the related problem of sorting signed permutations by reversals (SignedSBR). We continue with Sorting By Transpositions and, ﬁnally, we mention some other related questions.

Unsigned Sorting By Reversals Computational aspects of genome rearrangements was pioneered by Sankoﬀ [108, 109] in the 1990s. In particular, he formalized the problem of Sorting By Reversals (SBR). Sorting a permutation π by reversals is the problem of ﬁnding the least number of reversals needed to sort π. Kececioglu and Sankoﬀ [81] observed that the number of breakpoints, elements that are adjacent in π but not in the identity permutation, can be decreased by at most 2 with each reversal. For example, the permutation (1 3 2 4) contains two breakpoints 1 3 and 2 4 and both can be removed by reversing the segment 3 2. This provided the ﬁrst lower bound for the reversal distance b(π) drev (π) ≥ , 2 where drev (π) is the reversal distance and b(π) the number of breakpoints. Since it is possible to always ﬁnd a reversal that decreases the number of breakpoints by at least one, Kececioglu and Sankoﬀ’s observation resulted in a 2-approximation. In 1996, Bafna and Pevzner [12] introduced the notion of the breakpoint graph for representing permutations. The breakpoint graph for the permutation π = (π1 . . . πn ) is an edge-colored graph with n+2 vertices {π0 , π1 , . . . , πn+1 }. For each i = [0, n] a black edge is added between πi and πi+1 and a red arc is added between πi and πi + 1, see Figure 2.6. The intuitive idea is that the black edges describe the order in the permutation and the red arcs describe the order in the identity permutation. A cycle in this edge colored graph is called alternating if the consecutive edges in the cycle are of diﬀerent colors. Since each vertex in the graph has an equal number of red arcs and black edges, there is an alternating Eulerian cycle in the breakpoint graph. Consequently, it is possible to decompose the graph into a maximal alternating cycle decomposition. We denote the number of cycles in the maximal decomposition by c(π).

30

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

Bafna and Pevzner observed that the identity permutation is the only permutation with n + 1 cycles in the maximal decomposition. Moreover, they observed that the number of cycles in the maximal decomposition can change by at most one by reversing a segment. Thereby, Bafna and Pevzner provided a new lower bound for the reversal distance drev (π) ≥ n + 1 − c(π), from which they obtained a 1.75 approximation algorithm. The breakpoint graph has ever since been the classical tool for analyzing genome rearrangements.

......... ......... ......... ......... ......... ......... ......... ......... ..... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ .. . .. . .. . .. . .. . .. .. .. . .. .... ............................................................................................................................................................................................................................................

0

1

2

3

4

5

6

7

8

....................................... ...... ........ ...... ...... ..... ...... ..... ..... . . . .... .... .... . . . .... .................................. . . . . . . ... . ...... .. ... ...... . . . . . . . . . . . .... .. ............ .... ..... ........................ ... . . . . . . . . .......... ............. . . . ...... .... ..... ...... ..... ..... .. . . .... . . . . . . . . . . . . . . . ... ... .... ... ...................................................... ... ...... . ... . . . . . . . . . ............ ........ . ... . .. ..... . . . . ...... . ... ........ . ... ... . . . . . . . . ...... ... .... ....... .. ... ..... ................. ........ . . . . . . ... . . . . . . . . . . . ... ... ... .... ... .... .... .... .. .. ..... ... ... . . . . . . . . . . . . . . . . . . . . . . .......................................................................................................................................................................................................................................

0

5

4

3

1

7

2

6

8

Figure 2.6: To the left the breakpoint graph for the unsigned identity permutation and to the right the breakpoint graph for the permutation (5 4 3 1 7 2 6). The approximation ratio of Bafna and Pevzner was improved to 1.5 by Christie [33] and further to 1.375 by Berman et al. [16]. In 1997, SBR and maximal alternating cycle decomposition were proved NP-hard by Caprara [24]. Furthermore, Berman and Karpinski [17] proved that it is NP-hard to approximate the reversal distance within a factor of 1.0008. The reversal diameter, RD(n), of the symmetric group is the maximum value of drev (π) taken over all permutations of n elements, i.e., RD(n) ,

max drev (π).

π:n(π)=n

Bafna and Pevzner [12] proved that the diameter is RD(n) = n − 1.

Sorting Signed Permutations by Reversals In reality, genes have a direction in which they are transcribed. If this information is available, in addition to the gene order, genomes are represented by signed permutations; permutations in which each element is preceded by a + or a − indicating the direction of the corresponding gene. In this setting, when a segment is reversed the signs of all the elements in the segment are changed, e.g., (+1 +2 + 3 + 4 + 5) → (+1 − 4 − 3 − 2 + 5). Bafna and Pevzner [12] were the ﬁrst to study the problem of sorting signed permutations by reversals (Signed SBR) and to extend the notion of the breakpoint graph. The breakpoint graph, Figure 2.7, for a signed permutations ~π = (π1 . . . πn ) is built by ﬁrst transforming it into an unsigned permutation where each positive

2.4. GENOME REARRANGEMENTS - GLOBAL MUTATIONS

31

element +x is replaced by (2x − 1 2x) and each negative element −x by (2x 2x − 1), e.g., (1 − 2 4 3) ⇒ (1 2 4 3 7 8 5 6). Since each element in the signed permutation is model by two elements, it is not allowed to perform reversals that cut between any two elements 2i − 1 and 2i. Thereafter, the breakpoint graph is constructed for the unsigned permutation and for each i the black edge and red arc between elements 2i − 1 and 2i are removed. The resulting graph uniquely decomposes into alternating cycles. Morover, as in the unsigned case, the number of cycles can at most increase by one with each reversal yielding the lower bound drev (~π ) ≥ n + 1 − c(~π ). Furthermore, Bafna and Pevzner obtained a 1.5 approximation algorithm with respect to this lower bound.

........... ..... ....... ... . .............................

0

1

............................... ...... ........ ..... ...... .... .................... ..................... ..... ...... ......... .... ......... ..... .... . ..... ...... .................. . ..... .... ..... ... ....... . ... . . . . . . . . . . . . . . . . . . . . ... . . .... ...... ... .. ..... . . . . . . . . . . . . . ... . . . . . . . . . ... ... .... .. .. .... .. ...... . . . . . . . . . ... ... ... . ... .. .. .. . . . ... . . ... . . . . . . . . . ... ... .. . . . ............................... ............................... ............................... .............................

2

4

3

7

8

5

6

9

Figure 2.7: The breakpoint graph for the signed permutation (1 − 2 4 3). In their seminal work, Hannenhelli and Pevzner [71] analyzed cycle structures in breakpoint graphs further. They observed that although it is possible to increase the number of cycles by at most 1, certain substructures of cycles, referred to as hurdles and fortresses, require one additional reversal. Furthermore, Hannenhelli and Pevzner proved that the exact reversal distance is given by drev (~π ) = n + 1 − c(~π ) + h(~π ) + f(~π ), where h and f are the number of hurdles and fortresses, respectively. Based on this result, they gave an O(n4 ) algorithm for ﬁnding a shortest sorting sequence. Subsequent papers [80, 120,√14] has in steps simpliﬁed the analysis and improved the running time to O(n3/2 log n). Furthermore, in Bader et al. [10] a linear time algorithm for computing the distance is provided. The signed reversal diameter, SRD(n), of the group of all signed permutations is the maximum value of drev (~π ) taken over all signed permutations of n elements, i.e., SRD(n) , max drev (~π ). ~ π :|~ π |=n

In [33], it is shown that the diameter is SRD(n) = n + 1.

32

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

Sorting By Transposition There has been signiﬁcantly less progress on the problem of sorting by transpositions, Figure 2.8. The complexity of sorting by transpositions is still open. In this setting, after adding the element 0 and n + 1 to the beginning and end of the permutation, respectively, the set of breakpoints is deﬁned by each pair of elements πi πi+1 for which πi+1 − πi 6= 1, Figure 2.8. Since a transposition can decrease the number of breakpoints by at most 3, the lower bound based on breakpoints is given by b(π) dt (π) ≥ . 3 Moreover, it is always possible to perform a transposition which decreases the number of breakpoints with at least 1. This simple analysis yields a 3 approximation algorithm. 54321→32541→34125→12345 Figure 2.8: A shortest sequence of transpositions for sorting the permutation π = (5 4 3 2 1). Altogether there are 7 breakpoints in π: 0 5, 5 4, . . ., 1 6. The breakpoint graph, see Figure 2.9, for the permutation π = (π1 . . . πn ) consists of 2n + 2 vertices {r0 , l1 , r1 , . . . , ln , rn , ln+1 }, where li and ri are used to represent the right and left side of element i. Let π0 = 0 and πn+1 = n + 1 and further for every 0 ≤ i ≤ n, connect ri and li+1 by a red arc, and for every πi , connect lπi and rπi−1 by a black edge. Similarly to signed SBR, since each vertex has degree 2 the graph decomposes into a unique alternating cycle decomposition. We say that an alternating cycle is odd if it contains an odd number of red arcs and denote the number of odd cycles by codd (π). Bafna and Pevzner [13] observed that a transposition can change the number of odd cycles by at most two. Therefore, since the identity has n + 1 odd cycles, Bafna and Pevzner show that the transposition distance is bounded by n + 1 − codd (π) dt (π) ≥ . 2 Furthermore, they gave a 1.5-approximation algorithm which with every three consecutive transpositions increases the number of odd cycles by 4. Bafna and Pevzner’s algorithm was simpliﬁed by Christie [33] and further by Hartman and Shamir [72], who also proved that the analogous problem for circular permutations is equivalent. Subsequently, Elias and Hartman [48] improved the approximation ratio to 1.375 by giving an algorithm which in every 11 consecutive moves increases the number of odd cycles by 16. Eriksson et al. [54] provided an algorithm that sorts any given permutation on n elements by at most 2n/3 transpositions, but has no approximation guarantee.

2.4. GENOME REARRANGEMENTS - GLOBAL MUTATIONS

33

....................................... ......... ....... ...... ....... ..... ...... ..... ..... . . . . ..... ..... . ..... . . .... .... . . .... . ... .... . . . . . . . . . . . . . . . . . . . . . . . . . . . .... ............ ....... . . . . . . . ... . . . ..... ... ..... ... ..... . . . . . . ... .... . ................ . . . . . ... . . . .... .... ... ............................ . . . . . . . . . . ....... ...... ... ... ....... .. . . ... . . . . . ..... . . . ........ . . . . . ...... . ... . ... .... ..... .... ..... ... ... ......... . ... . . . ... ... .... ................ .......................................... . .. .. . . ... . . . . . . . . . . ................ .......... . ... .... . .. ... ....... . . . . ... ... . . . . ... . . . . . . . . ... ...... ... .. ... ... ..... ..... ........ ...... . ... . . . . . . . . . . . . ... .... ... ... .. .... .... ... . ......................... ............................ ..... . . . . . . . ... ... . . . . . . . . . . . . . . .......... ... . ..... ... ... ... .. ... .... ..... ..... ..... ..... ..... .... . . . . . ... . . . . . . . . . . . . . .. . .. ... . . . .. . . .. .. .. .. .. . .......................... ............................ .......................... ......................... .......................... ........................... ........................... ..........................

r 0 l5 r 5 l4 r 4 l3 r 3 l1 r 1 l7 r 7 l2 r 2 l6 r 6 l8

Figure 2.9: The sorting by transposition breakpoint graph for the permutation (5 4 3 1 7 2 6). The graph has one cycle of length 5 and one of length 3.

The transposition diameter, TD(n), of the symmetric group is the maximum value of dt (π) taken over all permutations of n elements, i.e., TD(n) ,

max dt (π).

π:n(π)=n

The transposition diameter of the symmetric group Sn is unknown. Bafna and Pevzner [13] proved an upper bound of 43 n, which was improved to 32 n by the algorithm of Eriksson et al. [54]. A lower bound of ⌊ n2 ⌋ + 1 is given in [33, 54, 93], where it was conjectured to be the transposition diameter, except for n = 13 and n = 15. Subsequently, Elias and Hartman [48] proved a lower bound of ⌊ n+1 2 ⌋+ 1 on the transposition diameter of the symmetric group, thereby disproving the conjecture of [33, 54, 93]. Furthermore, Elias and Hartman provided bounds on two central subgroups of the symmetric group.

Other Genome Rearrangement Problems Sorting by Trans-reversals (SBTR) is another biologically motivated problem. The complexity of this problem is still open and any progress seems to be tightly linked to that of SBT. Gu et al. [69] have given a 2-approximation for the problem. Moreover, if an additional operation, called rev-rev, is allowed, which swaps two consecutive segments and thereafter reverses both of them, then a 1.5 approximation algorithm is provided by Hartman and Sharan [73]. Further, Eriksen [53] presents an approximation algorithm for a version in which transpositions have weight 2 and reversals weight 1. In [25], Caprara considered the signed reversal median problem (Median-SBR) and showed that it is NP-hard. In this problem, three signed permutations, ~a, ~b, and ~c, are given as input and the problem is to seek a fourth permutation ~π minimizing the total distance to the other three, i.e., argmin~π drev (~π , ~a) + drev (~π , ~b) + drev (~π , ~c). Caprara’s result imply that it is NP-hard also to ﬁnd the most parsimonious tree and even the most parsimonious ancestral permutations in a given tree. These are

34

CHAPTER 2. INTRODUCTION TO COMPUTATIONAL PROBLEMS

two of the problems mentioned in the beginning of this section. There are however a series of results with heuristics for phylogenetic tree reconstruction from geneorder data, see [95] for a review. Most of these heuristics are distance-based tree reconstruction algorithms such as Neighbor-Joining and the Disk-Covering Method. In reality short reversals are more common than long reversals. Therefore, in [117] the problem of sorting by length weighted reversals is studied. There are also several papers on generalizing the Hannenhelli Pevzner algorithm to handle multichromosomal genomes [15]. Furthermore, there are few works on genomes with duplicated genes [34, 67]. Table 2.2 summarizes relevant problems and complexity results. Problem Unsigned-SBR Signed-SBR Median SBR Signed-SBR with duplications SBT SBTR SBTR + rev-rev

Approx. 1.5[12] 1[71] 2 k[67] 1.375[48] 2.0[69] 1.5[73]

Complexity NP[24] √ 3/2 log n)[120] O(n APX[25] NP-hard [34] ? ? ?

Table 2.2: A summary of complexity results for central genome rearrangement problems.

Chapter 3

Present Investigation In this chapter, we present the results and formalize the problems that this thesis is concerned with. The previous chapter provides an introduction that places the results into context. The aim is that this chapter should be self contained. Therefore, each section begins with a summary of the main results followed by formal deﬁnitions. The chapter ends with a section containing conclusions and open problems for each of the problem settings.

Common Definitions Strings are sequences from the binary alphabet Σ = {0, 1} or the DNA-alphabet {A, C, G, T }. A phylogenetic tree is an unrooted binary tree T together with a leaf labelling function λ, with vertex set V(T ) and edge set E(T ). The leaf labelling function is a bĳection between the leaf set L(T ) and a set of taxa T . The taxa are either represented by a set of strings, by a set of permutations, or simply by a ˆ is a labelling of unique integer/identiﬁer per taxa. A full labelling of a phylogeny λ all nodes of the tree such that the labelling at the leafs are the same as in the leaf ˆ labelling, i.e., for each l ∈ L(T ) λ(l) = λ(l). Further, we denote the edges on the path between two nodes a and b by PT (a, b).

3.1

Multiple Sequence Alignment is NP-hard

The paper Settling the Intractability of Multiple Alignment Elias [46] is a contribution to the problem of Multiple Alignment, see Section 2.3 for an introduction to the ﬁeld. The paper settles the intractability of Multiple Alignment which is one of the most central problems in computational biology. Previous to this paper, not even the most elementary variation of the problem, multiple alignment under the unit metric, had been proved to be hard. The main results of the paper are: • The SP-score is NP-hard for all metrics over binary or larger alphabets. 35

36

CHAPTER 3. PRESENT INVESTIGATION • Star Alignment is NP-hard for all metrics over binary or larger alphabets. • Tree Alignment is NP-hard for all metrics over binary or larger alphabets. • Consensus Patterns and Substring Parsimony are NP-hard.

Formal Definitions A pairwise alignment of strings s1 and s2 is a 2 × l matrix A, where row one and two contain s1 and s2 interleaved by spaces, respectively. Table 3.1 depicts a symmetric symbol distance for the binary alphabet and the space symbol ′ −′ . Symbol distances for which identity, symmetry, and triangle inequality holds are Pl called symbol metrics. By the cost of A we mean dA (s1 , s2 ) = i=1 µ(r1 [i], r2 [i]), where r1 and r2 are the rows in A and µ the symbol metric. We call the least such cost, denoted d(s1 , s2 ), the evolutionary distance. µ 0 1 −

0 0 α β

1 α 0 γ

− β γ 0

Table 3.1: A symbol distance for the binary alphabet. The most simple of metrics, the unit metric, is the metric in which all non-zero distances are exactly 1. The cost of the minimum pairwise alignment under the unit metric is also referred to as the edit distance for strings. The edit distance is simply the minimum number of edit operations (substitutions, insertions, and deletions) needed to transform one of the strings into the other. A multiple alignment of strings s1 , . . . , sk is a k×l matrix A where row i contains string si interleaved by spaces. The score of the alignment is given by dA (s1 , . . . , sn ) =

l X

d(r1 [i], . . . , rn [i]),

i=1

where d(r1 [i], . . . , rn [i]) represents the cost of the i’th column. It is not apparent what scoring function to use for columns. Over the years many schemes have been suggested, but the most common variations are the SP-score (sum-of-pairs score), Star Alignment, and Tree Alignment (for a given phylogeny). These schemes have in common that they depend on a symbol distance, but they diﬀer in how the pairwise distances in a column are combined to give a total score for the column. Moreover, as is shown by Elias [46], they are NP-hard to compute for all metric symbol distances. In addition, the reductions can be extended to show that the problems are intractable for the most elementary similarity score.

3.1. MULTIPLE SEQUENCE ALIGNMENT IS NP-HARD

37

Sum-of-Pairs Score The SP-score is probably the most used scheme in practice. In this scoring scheme, each column is scored by the sum of all pairwise distances between the symbols in the column: d(r1 [i], . . . , rn [i]) ,

n X n X

µ(rx [i], ry [i]).

x=1 y=x

Alternatively it can be seen to sum all pairwise string distances in the alignment matrix: k X k X dA (si , sj ). i=1 j=i

Star Alignment The Star Alignment scheme scores each column by the distance of each symbol in the column to the consensus symbol as: d(r1 [i], . . . , rn [i]) , min c∈Σ

n X

µ(c, rx [i]).

x=1

Star Alignment can be alternative formulated as computing a sequence s∗ minimizing the sum of pairwise alignments between s∗ and the input sequences: min ∗ s

k X

d(s∗ , si ).

i=1

It should be noted that whenever the symbol distance is a metric, the induced evolutionary distance also describes a metric. As a result, Star Alignment can be seen as a Steiner star problem in a metric space. Moreover, it should be mentioned that all approximation algorithms for Star Alignment are based on those for Steiner star problems. Tree Alignment In Tree Alignment, the input is a phylogenetic tree T with leaf labeling function λT assigning sequences to the leafs. The score of a column is deﬁned as the parsimony score, with respect to the symbol distance, in T for the symbols in the column. Alternatively, the objective can be formulated as ﬁnding ˆT minimizing the weighted number of edit operations needed to a full labeling λ mutate labels at adjacent nodes in the tree: X ˆ T (u), λ ˆ T (v)). d(λ min ˆT λ

(u,v)∈E(T )

As with Star Alignment, Tree Alignment can be seen as a Steiner tree problem with ﬁxed topology in a metric space. In Elias [46], Tree Alignment is shown to be NP-hard for all metrics over binary or larger alphabets. Since Tree Alignment has a polynomially bounded optimal solution, the result implies that an FPTAS cannot exist unless P=NP. Thereby, the result closes the complexity of Tree Alignment (a PTAS is presented in [125, 124])

38

CHAPTER 3. PRESENT INVESTIGATION

Conserved Regions The expression of a gene is regulated by transcription factors that bind to short regions on the DNA helix. These regions are called regulatory elements and appear in the vicinity of the gene. Since the regulatory elements are short, insertions and deletions easily disrupt the binding ability of the transcription factors. Therefore, in gene regulation studies, it is of interest to ﬁnd conserved regions (not allowing insertions and deletions) that are common to a set of sequence. Consensus Patterns and Substring Parsimony, deﬁned below, are natural formalizations of the problem of ﬁnding the most conserved region in a star and a phylogeny. These problems were ﬁrst studied in [103, 22, 19]. While the NP-hardness result for Substring Parsimony is new, Consensus Patterns has earlier been proved NP-hard [19, 3] and W[1]-hard in [56]. However, my reductions of Star Alignment, Tree Alignment, Consensus Patterns, and Substring Parsimony are all very similar and give an overall understanding into the intractability of the problems. Given a set of strings S = {s1 , . . . , sn } and a number k. Consensus Patterns is the problem of ﬁnding a string c of length k and substrings P ti of si of length k such that the the sum of Hamming distances is minimized, ni=1 dH (c, ti ). For a leaf labeled tree of bounded degree and a number k, a k-substring labeling is a full labeling of the tree such that all internal vertices are labeled with strings of length k and each leaf with a substring of length k of the original leaf label. Substring Parsimony is the problem of ﬁnding such a labeling of minimum cost. The cost of the labeling is the sum of all edge lengths, where the length of an edge is the Hamming distance of the two labels.

3.2

Approximating Sorting By Transpositions

The paper A 1.375-Approximation Algorithm for Sorting by Transpositions by Elias and Hartman [48] is a contribution to the ﬁeld of genome rearrangements and in particular to the problem of Sorting By Transpositions, for an introduction see Section 2.4. The paper was the ﬁrst major contribution to the problem of Sorting By Transpositions in 10 years. The main results of the paper are: • A 1.375-approximation algorithm for Sorting By Transpositions, improving on the ten-year-old 1.5-approximation ratio [13]. • A lower bound for the transposition diameter, TD(n) ≥ ⌊ n+1 2 ⌋+ 1, disproving the conjecture of [33, 54, 93]. • The transposition diameter of simple permutations TDS(n) = ⌊ n+1 2 ⌋. • An upper bound k the j transpositionk diameter of 3-permutations j on + 3 ((n+1)/3mod8) + 1 ≈ 11 TD3(n) ≤ 11 n+1 24 2 24 n.

3.2. APPROXIMATING SORTING BY TRANSPOSITIONS

39

Formal Definitions Let π = (π1 . . . πn ) be a permutation on n elements. A transposition τ on π is an exchange of two disjoint contiguous segments. If the segments are A = πi , . . . , πj−1 and B = πj , . . . , πk−1 , then the result of applying τ on π, denoted τ · π, is (π1 . . . πi−1 πj . . . πk−1 πi . . . πj−1 πk . . . πn ). The problem of ﬁnding a shortest sequence of transpositions, which transforms a permutation into the identity permutation, is called Sorting By Transpositions. The transposition distance of a permutation π, denoted by d(π), is the length of the shortest sorting sequence. The breakpoint graph [13], see Figure 2.9 on page 33, is a graph representation of a permutation. The breakpoint graph for the permutation π = (π1 . . . πn ) consists of 2n + 2 vertices {r0 , l1 , r1 , . . . , ln , rn , ln+1 }. Let π0 = 0 and πn+1 = n + 1 and further for every 0 ≤ i ≤ n, connect ri and li+1 by a red arc, and for every πi , connect lπi and rπi−1 by a black edge bi . The intuitive idea is that the black edges describe the order in the permutation and the red arcs describe the order in the identity permutation. Since the degree of each vertex in the breakpoint graph is exactly 2, the graph uniquely decomposes into cycles. The length of a cycle is the number of black edges it contains. A k-cycle is a cycle of length k, and it is odd if k is odd. The number of odd cycles is denoted by codd (π), and let ∆codd (π, τ ) = codd (τ · π) − codd (π), where τ is a transposition that is applied on π. Bafna and Pevzner [13] proved that for all permutations π and transpositions τ , ∆codd (π, τ ) ∈ {−2, 0, 2}. Let n(π) denote the number of black edges in the breakpoint graph; a permutation with n elements has n + 1 black edges. The maximum number of cycles is obtained iﬀ π is the identity permutation, which contains n(π) cycles, all of them being odd (in particular, they are all of length 1). A k-cycle in the breakpoint graph is called short if k ≤ 3; otherwise, it is called long. A breakpoint graph is simple if it contains only short cycles. A permutation π is simple if the breakpoint graph is simple, and is a 2-permutation (resp. 3permutation) if the breakpoint graph contains only 2-cycles (3-cycles). The transposition diameter, TD(n), of the symmetric group is the maximum value of d(π) taken over all permutations of n elements, i.e., TD(n) ,

max d(π).

π:n(π)=n

Similarly, the transposition diameter of simple permutations (TDS), 2-permutations (TD2), and 3-permutations (TD3), is the longest distance for any such permutation to the identity. A common technique in genome rearrangement literature is to transform permutations with long cycles into simple permutations. This transformation consists of inserting new elements into the permutations and thereby splitting the long cycles (see [72] for a thorough description). If π ˆ is the permutation obtained by inserting elements into π then d(π) ≤ d(ˆ π ), since inserting new elements only can result in a permutation that requires more moves to be sorted. Clearly, any sorting

40

CHAPTER 3. PRESENT INVESTIGATION

of π ˆ can be used to mimic a sorting of π by simply removing all new elements in π ˆ . Such a transformation is called safe if it maintains the lower bound of Bafna and Pevzner, i.e., if n(π) − codd (π) = n(ˆ π ) − codd (ˆ π ). Lin and Xue [89] proved that every permutation can be transformed safely into a simple one. The approximation algorithm given in Elias and Hartman [48], ﬁrst safely transforms the given permutation π into a simple permutation π ˆ , then it ﬁnds a sorting sequence for π ˆ , and, ﬁnally, the sorting of π ˆ is mimicked on π. The sorting found for π ˆ is guaranteed to for every eleven consecutive moves use at least eight 2-moves and no more than three 0-moves. Thus the algorithm is guaranteed to return a solution with approximation ratio 11 8 = 1.375. The results in the paper, like many other results in genome rearrangements, are based on rigorous case analyzes. However, for the results on 3-permutations, which is the basis of the approximation algorithm, the number of cases is huge. Therefore, we developed a computer program that systematically generates the proof. Each case in the proof is discrete and consists of a few elementary steps that can be veriﬁed by hand and thus it is a proof in the conventional mathematical sense. Since it is not practical to manually verify the proof as a whole, a veriﬁcation program was written, which takes the proof as an input and veriﬁes that each elementary step in the proof is correct.

3.3

Fast Neighbor Joining

The paper Fast Neighbor Joining by Elias and Lagergren [49] is a contribution to the ﬁeld of distance matrix methods for tree reconstruction, see Section 2.1 for an introduction. There are two major contributions in the paper: • An algorithm called Fast Neighbor Joining (FastNJ) with optimal reconstruction radius and optimal runtime complexity O(n2 ). • A greatly simpliﬁed proof that also the Neighbor Joining (NJ) algorithm (see page 16) has optimal reconstruction radius. Initial experiments show that the FastNJ algorithm in practice has almost the same accuracy as the NJ algorithm; this indicates that it is the optimal reconstruction radius and other similarities with NJ that give FastNJ its good performance. The paper is one piece in the puzzle of constructing both fast and accurate tree reconstruction algorithms.

Formal Definitions A distance function is an n×n matrix D over the ﬁeld of positive real numbers, it is symmetric, and for all x it satisﬁes D[x, x] = 0. A phylogenetic tree T together with an edge length function w : E(T ) → R+P induces a leaf-to-leaf distance function, i.e., DTw : L(T )2 → R+ where DTw (a, b) , e∈PT (a,b) w(e).

3.4. FAST DISTANCE ESTIMATION FROM ALIGNED SEQUENCES

41

A distance function D is additive if there is a phylogenetic tree T together with a edge length function w such that D = DTw ; the tree is said to realize D, it is unique, and it is denoted T (D). A distance function D is nearly additive if there is a phylogenetic tree T together with a edge length function w such that |D − DTw |∞ < µ(T, w)/2, where | · |∞ is the max norm and µ(T, w) the shortest edge; again, the tree is said to realize D and it is unique with respect to the topology T. The FastNJ algorithm of Elias and Lagergren is a variant of the regular NJ algorithm [107]. It takes a distance function as input and outputs an additive distance matrix. FastNJ has running time O(n2 ), i.e., it is linear in the size of the input matrix, while the regular NJ algorithm has running time O(n3 ). In the paper, the FastNJ algorithm is shown to have optimal reconstruction radius in the sense that: (a) given a nearly additive distance function it reconstructs the unique tree T and (b) there can be more than one tree for which |D − DTw | < δ holds if δ ≥ µ(T, w)/2 (see Figure 2.2). The proof of Elias and Lagergren also provides a signiﬁcantly simpliﬁed proof of the fact that also NJ has optimal reconstruction radius.

3.4

Fast Distance Estimation from Aligned Sequences

The paper Fast Computation of Distance Estimators by Elias and Lagergren [50] is a contribution to the problem of computing estimated pairwise distance between aligned DNA sequences, see page 20 for an introduction. There are two contributions in the paper: • A fast algorithm for computing the number of mutational events between DNA sequences. • A new method for estimating pairwise distance when ambiguity symbols are present. The new algorithm is several hundred times faster than the famous Phylip package. Since pairwise distance estimation is a bottleneck in distance-based phylogeny reconstruction, the new algorithm improves the overall running time of many distance-based methods by a factor of several hundred.

Formal Definitions The algorithm presented in the paper takes two aligned DNA-sequences of length l as input and counts the number of diﬀerences in the strings. In particular, it counts the number of purine-transitions, pyrimidine-transitions, and transversions; purine-transitions are diﬀerences between purines (A and T), pyrimidine-transitions are diﬀerences between pyrimidines, and transversions are diﬀerences between the two groups. The algorithm has the same asymptotic runtime complexity as the straightforward approach of looping through every position, i.e. O(l). However,

42

CHAPTER 3. PRESENT INVESTIGATION

the novel algorithm presented in the paper adopts an advanced divide and conquer and bit ﬁddling strategy that achieves a running time which in practice is several hundred times faster. Distance methods, which are a major family of phylogenetic tree reconstruction methods, take as input a distance matrix containing estimated pairwise distances between all pairs of taxa. Distance methods themselves are often fast, e.g., Fast Neighbor Joining algorithm reconstructs a phylogeny of n taxa in time O(n2 ). Unfortunately, computing the distance matrix, from n sequences of length l, can in practice only be done in time l·n2 . Since the sequence length typically is much larger than the number of taxa, the distance estimation is the bottleneck in phylogeny reconstruction. The novel algorithm removes this bottleneck in practice. Ambiguity symbols are bases of uncertain identity, e.g., the letter R represents either an A or a G. These symbols occur either due to polymorphism within a population or due to failure during sequencing. There is no obvious approach for including ambiguity symbols in distance estimation when the model of evolution is probabilistic. There are two techniques: either as Swoﬀord [118] or as Felsenstein [61]. In the paper by Elias and Lagergren, a new method is presented which is signiﬁcantly more accurate and leads to faster computations than the previous methods.

3.5

Ancestral Sequence Reconstruction using Likelihood

The paper Reconstruction of Ancestral Genomic Sequences Using Likelihood by Elias and Tuller [51] is a contribution to the problem of computing ancestral sequences using the Ancestral Maximum Likelihood (AML) criterion, see page 23 for an introduction. The major contributions of the paper are: • A ﬁxed-parameter tractable (FPT) algorithm for both Small and Big AML. • A 2-approximation for Big AML. • A PTAS for Small AML. • AML is an inconsistent estimator of the phylogenetic tree. This work is a step toward developing approximation algorithms and investigating the properties of the AML criterion.

Formal Definitions The simplest probabilistic model of sequence evolution assumes that sites evolve independently and identically from the root down toward the leafs. Each edge in the tree has an associated probability of change which speciﬁes with what probability a site changes. We denote the edge probability for the edge e by p(e) and make the natural restriction 0 ≤ p(e) < 12 . In general, if two sequences of length m have

3.6. CONCLUSIONS AND OPEN PROBLEMS

43

hamming distance h and the change probability between these two sequences is p then the likelihood of observing two sequences as divergent as them is given by ˆ with ph · (1 − p)m−h . Hence, the likelihood of observing a fully labeled tree (T, λ) edge probability function p is Y ˆ p) = p(e)h(e) · (1 − p(e))m−h(e) , L(T, λ, e∈E(T )

where h(e) is the hamming distance of the two labels incident with the edge. Moreover, since the most likely edge probabilities are given by the hamming distance of the labels we have ˆ p) = max L(T, λ, p

Y

e∈E(T )

(

h(e) h(e) h(e) m−h(e) ) · (1 − ) . m m

ˆ p) by As a consequence, we exclude the edge probabilities and denote maxp L(T, λ, ˆ and the likelihood of the edge e by L(h(e)). This leads us to the deﬁnitions L(T, λ) of the two problems we are concerned with: • Given a set of sequences T the (Big) Ancestral Maximum Likelihood (AML) problem is that of ﬁnding the tree T together with the most likely full labelling ˆ such that each leaf is uniquely labeled by one of the sequences in T , i.e., λ ˆ max ˆ L(T, λ). (T,λ)

• Given a phylogenetic tree (T, λ) the Small Ancestral Maximum Likelihood (Small AML) problem is that of ﬁnding the most likely full labelling, i.e., ˆ maxλˆ L(T, λ).

3.6

Conclusions and Open Problems

In Elias [46], multiple alignment with SP-score, Star Alignment, and Tree Alignment are shown to be NP-hard. Previous to the paper, the problems were referred to as NP-hard even though the most elementary variants had not been proved hard. The paper closes the complexity of Tree Alignment (PTAS [125, 124] exists and an FPTAS cannot exist unless P=NP). However, the best approximation algorithms [11] for SP-score and Star Alignment have approximation ratio ≈ 2 and it is an open problem to improve this ratio and to give bounds on the in-approximation. Further, it is desirable with more accurate heuristics and also exact algorithms, as in Althaus et al. [5], that can handle bigger instances. Elias and Hartman [48] present the ﬁrst signiﬁcant advancement on the Sorting By Transpositions (SBT) problem in 10 years. The big open problem in the genome rearrangement ﬁeld is to close the complexity of SBT. Moreover, there is the problem of determining the transposition diameter of the symmetric group. The complexity of Sorting By Trans-reversals (SBTR) is also open. It seems likely that

44

CHAPTER 3. PRESENT INVESTIGATION

the approach that is used in Elias and Hartman to achieve the 1.375 approximation ratio of SBT should be applicable also to SBTR. Another interesting problem is to determine the expected reversal distance under the Nadeau-Taylor model [96]. Furthermore, it will be interesting with good heuristics for reconstructing parsimonious trees when all diﬀerent types of genome rearrangements are allowed. The big question with regard to distance methods is, as always, to ﬁnd a fast and accurate reconstruction algorithm. In my opinion there is room for signiﬁcantly more accurate heuristics. This does not only incorporate inventing new algorithms. It also includes understanding under what conditions current algorithms perform well and when they perform less well (or even bad). The papers [7, 94, 49, 68, 66, 18, 77, 85] (and many more) are all pieces in this puzzle. There are a series of open problems for likelihood based character methods for tree reconstruction. The most interesting result would be an approximation algorithm for the NP-hard ML criterion [32]. Another is to settle the tractability or intractability of Small ML. Similarly, Big AML is known to be NP-hard [1] while it remains to categories the complexity of Small AML. Furthermore, it would be interesting with branch and bound algorithms for ﬁnding optimal solutions.

Bibliography [1]

L. Addario-Berry, B. Chor, M. Hallett, J. Lagergren, A. Panconesi, and T. Wareham. Ancestral maximum likelihood of evolutionary trees is hard. J. of Bioinf. and Comp. Biology, 2(2):257–271, 2004.

[2]

R. Agarwala, V. Bafna, M. Farach, M. Paterson, and M. Thorup. On the approximability of numerical taxonomy (ﬁtting distances by tree metrics). SICOMP, 28(3):1073–1085, 1999.

[3]

T. Akutsu. Hardness results on gapless local multiple sequence alignment. Technical Report 98-MPS-24-2, 1998.

[4]

B. Alberts, A. Johnson, J. Lewis, M. Raﬀ, K. Roberts, and P. Walter. Molecular biology of the cell. Garland Science, 2002.

[5]

E. Althaus, A. Caprara, H. Lenhof, and K. Reinert. A branch-and-cut algorithm for multiple sequence alignment. Math. Program., 105(2-3):387–425, 2006.

[6]

Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403– 410, 1990.

[7]

K. Atteson. The performance of neighbor-joining methods of phylogenetic reconstruction. Algorithmica, 25, 1999.

[8]

G. Ausiello, P. Crescenzi, G. Gambosi, V. Kann, M. Marchetti-Spaccamela, and M. Protasi. Complexity and Approximation. Springer, 1999.

[9]

D. Baconn and W. Anderson. Multiple sequence alignment. Journal of Mol. Bio., pages 153–161, 1986.

[10] D.A. Bader, B. M.E. Moret, and M. Yan. A linear-time algorithm for computing inversion distance between signed permutations with an experimental study. Journal of Computational Biology, 8(5):483–491, 2001. [11] V. Bafna, E. L. Lawler, and P. Pevzner. Approximation algorithms for multiple sequence alignment. TCS: Theoretical Computer Science, 182, 1997. 45

46

BIBLIOGRAPHY

[12] V. Bafna and P. A. Pevzner. Genome rearragements and sorting by reversals. SIAM Journal on Computing, 25(2):272–289, 1996. [13] V. Bafna and P. A. Pevzner. Sorting by transpositions. SIAM Journal on Discrete Mathematics, 11(2):224–240, May 1998. Preliminary version in the Proceedings of SODA, 1995. [14] A. Bergeron. A very elementary presentation of the hannenhalli-pevzner theory. Discrete Applied Mathematics, 146:134–145, 2005. [15] A. Bergeron, J. Mixtacki, and J. Stoye. On sorting by translocations. Journal of Computational Biology, 13(2):567–578, 2006. [16] P. Berman, S. Hannenhalli, and M. Karpinski. 1.375-approximation algorithm for sorting by reversals. In Proc. of 10th Eurpean Symposium on Algorithms (ESA’02), pages 200–210. Springer, 2002. LNCS 2461. [17] P. Berman and M. Karpinski. On some tighter inapproximability results. In Proceedings of the 26th ICALP. Springer, 1999. [18] V. Berry and D.Bryant. Faster reliable phylogenetic analysis. In 3rd International Conference on Computational Molecular Biology (RECOMB), pages 59–68, 1999. [19] M. Blanchette, B. Schwikowski, and M. Tompa. Algorithms for phylogenetic footprinting. Journal of Computational Biology, 9(2):211–223, 2002. [20] P. Bonizzoni and G. D. Vedova. The complexity of multiple sequence alignment with SP-score that is a metric. Theoretical Computer Science, 259(1– 2):63–79, 2001. [21] W.J. Bruno, N.D. Socci, , and A.L. Halpern. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Molecular Biology and Evolution, 17(1):189–197, 200. [22] J. Buhler and M. Tompa. Finding motifs using random projections. Journal of Computational Biology, 9(2):225–242, 2002. [23] M. Buneman. The recovery of trees from measurements of dissimilarity. Mathematics in the Archaeological and Historical Sciences, pages 387–395, 1971. [24] A. Caprara. Sorting by reversals is diﬃcult. In Proceedings of the First International Conference on Computational Molecular Biology, pages 75–83, New York, January19–22 1997. ACM Press. [25] Alberto Caprara. The reversal median problem. INFORMS Journal on Computing, 15(1):93–113, 2003.

47 [26] H. Carrillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Math., 48:1073–1082, 1988. [27] L. L. Cavalli-Sforza and A. W. F. Edwards. Phylogenetic analysis: Models and estimation procedures. American J. of Human Genetics, 19:233–257, 1967. [28] J.A. Cavender. Taxonomy with conﬁdence. Math. Biosci., 40(271–280), 1978. [29] R. Chakraborty. Estimation of time of divergence from phylogenetic studies. Canadian Journal of Genetics and Cytology, 19:217–223, 1977. [30] J.T. Chang. Full reconstruction of markov models on evolutionary trees: Identiﬁability and consistency. Mathematical Biosciences, 137:51–73, 1996. [31] J.T. Chang. Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Mathematical Biosciences, 134:189–215, 1996. [32] B. Chor and T. Tuller. Maximum likelihood of evolutionary trees is hard. RECOMB, pages 296–310, 2005. [33] D. A. Christie. Genome Rearrangement Problems. PhD thesis, University of Glasgow, 1999. [34] D.A. Christie and R.W. Irving. Sorting strings by reversals and by transpositions. SIAM Journal on Discrete Mathematics, 14:193–206, 2001. [35] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, 1990. [36] Crochemore, Landau, and Ziv-Ukelson. A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SICOMP: SIAM Journal on Computing, 32, 2003. [37] M. Csűrös. Fast recovery of evolutionary trees with thousands of nodes. In RECOMB-01, pages 104–113, 2001. [38] C. Daskalakis, E. Mossel, and S. Roch. Optimal phylogenetic reconstruction. In Annual ACM Symposium on Theory of Computing (STOC), pages 159– 168, 2006. [39] W.H.E. Day. Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of Mathematical Biology, 49:461–467, 1986. [40] M. Dayhoﬀ, R. Schwartz, and B. Orcutt. A model of evolutionary change in proteins. Atlas of protein sequence and structure, 5:345–358, 1978.

48

BIBLIOGRAPHY

[41] R. Desper and 0. Gascuel. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. Journal of Computational Biology, 19(5):687–705, 2002. [42] T. Dobzhansky and A.Sturtvant. Inversions in the chromosome of drosophila pseudooobscura. Genetics, 1938. [43] A. W. F. Edwards and L. L. Cavalli-Sforza. The reconstruction of evolution. Annals of Human Genetics, 27:105–106, 1963. [44] A. W. F. Edwards and L. L. Cavalli-Sforza. Reconstruction of evolutionary trees. Phenetic and Phylogenetic Classification, pages 67–76, 1964. [45] I. Elias. Settling the intractability of multiple alignment. In Proc. of the 14th Ann. Int. Symp. on Algorithms and Computation (ISAAC’03), volume 2906 of Lecture Notes in Computer Science, pages 352–363. Springer-Verlag, November 2003. [46] I. Elias. Settling the intractability of multiple alignment. Journal of Computational Biology, 13(7):1323–1339, September 2006. [47] I. Elias and T. Hartman. A 1.375-approximation algorithm for sorting by transpositions. In Proc. of the 5th International Workshop on Algorithms in Bioinformatics (WABI’05), volume 3692 of Lecture Notes in Computer Science, pages 204–214. Springer-Verlag, October 2005. [48] I. Elias and T. Hartman. A 1.375-approximation algorithm for sorting by transpositions. To appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2006. [49] I. Elias and J. Lagergren. Fast neighbor joining. In Proc. of the 32nd International Colloquium on Automata, Languages and Programming (ICALP’05), volume 3580 of Lecture Notes in Computer Science, pages 1263– 1274. Springer-Verlag, July 2005. [50] I. Elias and J. Lagergren. Fast computation of distance estimators. Submitted, 2006. [51] I. Elias and T. Tuller. Reconstruction of ancestral genomic sequences using likelihood. Submitted, 2006. [52] P.L. Erdös, M.A. Steel, L.A. Szekely, and T.J. Warnow. A few logs suﬃce to build (almost) all trees (I). RSA: Random Structures & Algorithms, 14:153– 184, 1999. [53] Niklas Eriksen. (1+eps)-approximation of sorting by reversals and transpositions. Journal of Theoretical Computer Science, 289:517–529, 2002.

49 [54] H. Eriksson, K. Eriksson, J. Karlander, L. Svensson, and J. Wastlund. Sorting a bridge hand. Discrete Mathematics, 241(1-3):289–300, 2001. [55] J.S. Farris. A probability model for inferring evolutionary trees. Syst. Zool., 22:250–256, 1973. [56] M. Fellows, J. Gramm, and R. Niedermeier. On the parameterized intractability of CLOSEST SUBSTRING and related problems. In In Proc. of 19th STACS, volume 2285, pages 262–273, 2002. [57] J. Felsenstein. Cases in which parsimony or compatibility methods will be positively misleading. Syst. Zool., 22:240–249, 1978. [58] J. Felsenstein. Evolutionary trees from dna sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17:368–376, 1981. [59] J. Felsenstein. Phylip (phylogeny inference package) version 3.5c. Technical report, Department of Genetics, University of Washington, Seattle., 1993. [60] J. Felsenstein. An alternating least squares approach to inferring phylogenies from pairwise distances. Systematic Biology, 46:101–111, 1997. [61] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, 2001. [62] D. Feng and R. Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol., 25:351–360, 1987. [63] W.M. Fitch. Toward deﬁning course of evolution: minimum change for a speciﬁed tree topology. Systematic Zoology, 20:406–416, 1971. [64] W.M. Fitch and E. Margoliash. Construction of phylogenetic trees. Science, 155:279–284, 1967. [65] O. Gascuel. Bionj: an improved version of the nj algorithm based on a simple model of sequence data. Molecular Biology and Evolution, 14:685–695, 1997. [66] O. Gascuel. Concerning the NJ algorithm and its unweighted version, UNJ. American Mathematical Society, pages 149–170, 1997. [67] A. Goldstein, P. Kolman, and J. Zheng. Minimum common string partition problem: Hardness and approximations. In Algorithms and Computation, 15th International Symposium, ISAAC 2004, volume 3341 of Lecture Notes in Computer Science, pages 484–495. Springer, 2004. [68] I. Gronau and S. Moran. Neighbor joining algorithms for inferring phylogenies via lca-distances. submitted, 2006. [69] Q. P. Gu, S. Peng, and H. Sudborough. A 2-approximation algorithm for genome rearrangements by reversals and transpositions. Theoretical Computer Science, 210(2):327–339, 1999.

50

BIBLIOGRAPHY

[70] D. Gusﬁeld. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge, UK, 1997. [71] S. Hannenhalli and P. Pevzner. Transforming cabbage into turnip: Polynomial algorithm for sorting signed permutations by reversals. Journal of the ACM, 46:1–27, 1999. [72] T. Hartman and R. Shamir. A simpler and faster 1.5-approximation algorithm for sorting by transpositions. Information and Computation, 204(2):275–290, February 2006. [73] T. Hartman and R. Sharan. A 1.5-approximation algorithm for sorting by transpositions and transreversals. Journal of Computer and System Sciences, 70:300–320, 2005. [74] S. Henikoﬀ and J. Henikoﬀ. Amino acid substitution matrices from protein blocks. PNAS, 89:10915–10919, 1992. [75] S. B. Hoot and J. D. Palmer. Structural rearrangements, including parallel inversions, within the chloroplast genome of Anemone and related genera. J. Molecular Evooution, 38:274–281, 1994. [76] T. Hubbard, A.Lesk, and A. Tramontano. Gathering them into the fold. Nature Structural Biology, 4:313, 1996. [77] D.H. Huson, S. Nettles, and T. Warnow. Disk-covering, a fast-converging method for phylogenetic tree reconstruction. Journal of Computational Biology, 6(3/4):369–386, 1999. [78] T.H. Jukes and C.R. Cantor. Evolution of protein molecules. Mammalian Protein Metabolism, pages 21–132, 1969. [79] W. Just. Computational complexity of multiple sequence alignment with sp-score. Journal of Computational Biology, 8(6):615–623, 2001. [80] H. Kaplan, R. Shamir, and R. E. Tarjan. Faster and simpler algorithm for sorting signed permutations by reversals. SIAM Journal of Computing, 29(3):880–892, 2000. [81] J. Kececioglu and D. Sankoﬀ. Exact and approximation algorithms for sorting by reversals, with application to genome rearrangement. Algorithmica, 13(1/2):180–210, January 1995. A preliminary version appeared in Proc. CPM93, Springer, Berlin, 1993, pages 87–105. [82] K.K. Kidd and L.A. Sgaramella-Zonta. Phylogenetic analysis: Concepts and methods. American Journal of Human Genetics, 23:235–252, 1971.

51 [83] M. Kimura. A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16:111–120, 1980. [84] H. Kishino and M. Hasegawa. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from dna sequence data, and the branching order in hominoidea. Journal of Molecular Evolution, 29:170–179, 1989. [85] J. Lagergren. Combining polynomial running time and fast convergence for the disk-covering method. JCSS: Journal of Computer and System Sciences, 65, 2002. [86] Eric S. Lander, Robert Langridge, and Damian M. Saccocio. Mapping and interpreting biological information. Commun. ACM, 34(11):32–39, 1991. [87] V. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones. Problems of Information Transmission, 1:8–17, 1965. [88] M. Li, B. Ma, and L. Wang. Finding similar regions in many strings. In Proceedings of the Thirty-First Annual ACM Symp. on Theory of Computing, pages 473–482, May 1999. [89] G. H. Lin and G. Xue. Signed genome rearrangements by reversals and transpositions: Models and approximations. Theoretical Computer Science, 259:513–531, 2001. [90] D. J. Lipman and W. R. Pearson. Rapid and sensitive protein similarity searches. Science, 227:1435–1441, 1985. [91] V. Makarenkov and B. Leclerc. The ﬁtting of a tree metric to a given dissimilarity with the weighted least squares criterion. Journal of Classification, 16:3–26, 1999. [92] Masek and Paterson. A faster algorithm computing string edit distances. JCSS: Journal of Computer and System Sciences, 20, 1980. [93] J. Meidanis, M. E. Walter, and Z. Dias. Transposition distance between a permutation and its reverse. In Proceedings of the 4th South American Workshop on String Processing, pages 70–79, 1997. [94] R. Mihaescu, D. Levy, and L. Pachter. Why neighbor-joining works. submitted, 2006. [95] B.M.E. Moret, J. Tang, and T. Warnow. Reconstructing phylogenies from gene-content and gene-order data. In Mathematics of Evolution and Phylogeny. Oxford Univ. Press., 2005.

52

BIBLIOGRAPHY

[96] J. Nadeau and B. Taylor. Lengths of chromosomal segments conserved since divergence of man and mouse. Proceedings of the National Academy of Sciences USA, 1984. [97] S. Needleman and C. Wunsch. A general method applicable to the search for similarity in the amino acid sequences of two proteins. J. Mol. Biol., 48:443–453, 1970. [98] J. Neyman. Molecular studies of evolution: a source of novel statistical problems. Statistical Decision Theory and Related Topics, pages 1–27, 1971. [99] U.S. Department of Energy http://www.ornl.gov/hgmis.

Human

Genome

Program.

[100] J. D. Palmer and L. A. Herbon. Tricircular mitochondrial genomes of Brassica and Raphanus: reversal of repeat conﬁgurations by inversion. Nucleic Acids Research, 14:9755–9764, 1986. [101] J. D. Palmer and L. A. Herbon. Unicircular structure of the Brassica hirta mitochondrial genome. Current Genetics, 11:565–570, 1987. [102] J. D. Palmer and L. A. Herbon. Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence. J. Molecular Evolution, 28:87–97, 1988. [103] P. A. Pevzner and S. Sze. Combinatorial approaches to ﬁnding subtle signals in DNA sequences. In The Annual Meeting of the International Society for Computational Biology (ISMB), pages 269–278, 2000. [104] T. Pupko, I. Peer, R. Shamir, and D. Graur. A fast algorithm for joint reconstruction of ancestral amino acid sequences. MBE, 17(6):890–896, 2000. [105] A. Rzhetsky and M. Nei. Theoretical foundations of the minimum-evolution method of phylogenetic inference. Molecular Biology and Evolution, 10:1073– 1095, 1993. [106] S. Guindon S and O. Gascuel. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Systematic Biology, 52(5):696– 704, 2003. [107] N. Saitou and M. Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4:406–425, 1987. [108] D. Sankoﬀ, R. Cedergren, and Y. Abel. Genomic divergence through gene rearrangement. Methods in Enzymology, 183:428–438, 1990. [109] David Sankoﬀ. Edit distance for genome comparison based on non-local operations. Lecture Notes in Computer Science, 644:121–135, 1992.

53 [110] S. Sattath and A. Tversky. Additive similarity trees. Psychometrika, 42:319– 345, 1977. [111] J. S. Sim and K. Park. The consensus string problem for a metric is npcomplete. In Proceedings of the10th Australasian Workshop On Combinatorial Algorithms, pages 107–113, 1999. [112] T. F. Smith and M. S. Waterman. Identiﬁcation of common molecular subsequences. Journal of Molecular Biology, Vol. 147:195–197, 1981. [113] P.H.A. Sneath and R.R. Sokal. Numerical Taxonomy : The principles and practice of numerical classification. W.H. Freeman, San Francisco, 1973. [114] R.R. Sokal and C.D. Michener. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38:1409–1438, 1958. [115] M. Steel and D. Penny. Parsimony, likelihood, and the role of models in molecular phylogenetics. MBE, 17(6):839–850, 2000. [116] J.A. Studier and K.j. Keppler. A note on the neighbor-joining algorithm of saitou and nei. Molecular Biology and Evolution, 5:729–731, 1988. [117] F. Swidan, M. A. Bender, D. Ge, S. He, H. Hu, and R. Pinter. Sorting by length-weighted reversals: Dealing with signs and circularity. In CPM: 15th Symposium on Combinatorial Pattern Matching, 2004. [118] D.L. Swoﬀord. Phylogenetic Analysis Using Parsimony (*and Other Methods). Version 4. Sinauer Associates, 2002. [119] K. Tamura and M. Nei. Estimation of the number of nucleotide substitutions in the control region of mitochondrial dna in humans and chimpanzees. Molecular Biology and Evolution, 10:512–526, 1993. [120] E. Tannier and M. Sagot. Sorting by reversals in subquadratic time. In Proc. 15th Annual Symposium on Combinaotrial Pattern Matching (CPM ’04), pages 1–13. Springer, 2004. [121] J. Thompson, D. Higgins, and T. Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-speciﬁc gap penalties and weight matrix choice. Nucleic Acids Res, 22:4673–4680, 1994. [122] I. Ulitsky, D. Burstein, T. Tuller, and B. Chor. The average common substring approach to phylogenomic reconstruction. J. Comp. Biol., 13(2):336–350, 2006. [123] L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1:337–348, 1994.

54

BIBLIOGRAPHY

[124] L. Wang, T. Jiang, and D. Gusﬁeld. A more eﬃcient approximation scheme for tree alignment. SIAM Journal on Computing, 30(1):283–299, February 2001. [125] L. Wang, T. Jiang, and E. L. Lawler. Approximation algorithms for tree alignment with a given phylogeny. Algorithmica, 16(3):302–315, September 1996. [126] M. Waterman. In Mathematical Methods for DNA Sequences. CRC Press, 1989.