Phylogenetic Inference using Genetic Algorithm-based Least Squares Methods

Aarhus Universitet Datalogisk Institut Åbogade 34 8200 Århus N December 20, 2006 Final Version Phylogenetic Inference using Genetic Algorithm-based ...

Author: Carol O’Brien’

2 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

synapomorphy Principles of Phylogenetics: Tree Thinking PHYLOGENETIC INFERENCE PHYLOGENETIC INFERENCE Goal: Principles: PHYLOGENETIC INFERENCE

Human Detection Using Partial Least Squares Analysis

"Least Squares Fitting" Using Artificial Neural Networks

Voice Conversion Using Partial Least Squares Regression

Classification using Generalized Partial Least Squares

Multitask Learning Using Partial Least Squares Method

3D COORDINATE TRANSFORMATION USING TOTAL LEAST SQUARES

Image Deformation Using Moving Least Squares

Partial Least Squares

Least Squares Data Fitting

Bivariate Least Squares

Tutorial: Least-Squares Fitting

Partial Least Squares

Regularized Least Squares

Partial Least Squares

Partial least squares

Least-Squares Policy Iteration

Generalized Least Squares

General Least Squares Fitting

242 Least squares fitting

Least-Squares Approximation

Selection of preprocessing methods for Partial Least Squares regression of acoustic spectrometry data using a Genetic Algorithm

Least Squares Quantization

Least Squares Problems

Aarhus Universitet Datalogisk Institut Åbogade 34 8200 Århus N

December 20, 2006 Final Version

Phylogenetic Inference using Genetic Algorithm-based Least Squares Methods

Steffen Wang Fischer and Mads Peter Lindberg

Supervisor: Christian Nørgaard Storm Pedersen

There is no theory of evolution, just a list of creatures Chuck Norris allows to live. Chuck Norris Facts

In the beginning the Universe was created. This has made a lot of people very angry and has been widely regarded as a bad move. Douglas Adams

I like pigs. Dogs look up to us. Cats look down on us. Pigs treat us as equals. Sir Winston Churchill

c 2006 Steffen Wang Fischer & Mads Peter Lindberg

The University of Aarhus is authorized to publish this Master’s thesis. The thesis has been typeset with Computer Modern 12pt Layout and typography by the authors assisted by LATEX Figures made with Xfig, and graphs using gnuplot and R

Abstract The only illustration in Darwin’s Origin of the Species shows a phylogenetic tree. This type of tree is used to represent the evolutionary relationship between different species. Phylogenetic inference is constructing such a tree based on the measured pairwise distance of all species. There exists many methods to evaluate how well a particular tree fits the measurements. One of these is called the least squares method, in which we wish to minimize the square of the error of all species between the measured distance and the distance in the tree. Because the search space of trees is huge, it is not possible to examine all possibilities; it is necessary to utilize a search heuristic. We have chosen to use genetic algorithms (GA) which handles and changes a population of trees. GAs have been used in several articles on similar problems, but we find that it is often applied superficially. For instance there is seldom spent time on choosing the GA’s operators, the initial population or how to choose the parameters of the GA. In this thesis we have therefore implemented a genetic algorithm to solve the phylogeny problem with respect to least squares, and we have examined how the initial population and in particular the parameters influences the efficiency of the GA. Our experiments have shown that a self-adaptive approach gives good results without the need for time consuming tuning of parameters, and that the GA can be quickened by basing the initial population on a faster heuristic. In addition, our GA has been able to find better results than the neighbor joining algorithm.

Resumé Den eneste illustration der optræder i Darwins Origin of the Species (på dansk Om Arternes Oprindelse) viser et fylogenetisk træ. Denne type træ bruges til at repræsentere det evolutionære forhold mellem forskellige arter. Fylogenetisk udledning handler om at konstruere et sådant træ på baggrund af arternes indbyrdes målte afstande. Der findes mange metoder til at vurdere, hvor godt et givet træ passer på de biologiske målinger. En af disse kaldes mindste kvadratmetoden, hvor man ønsker at minimere kvadratet af afvigelsen for samtlige arter mellem den målte afstand og afstanden i træet. Fordi søgerummet af træer er enormt stort, er det ikke muligt at undersøge samtlige muligheder; det er nødvendigt at benytte sig af en søgeheuristik. Vi har valgt at studere genetiske algoritmer (GA), der håndterer og manipulerer en population af træer. GA’er er blevet anvendt i flere artikler på lignende problemer, men i vores øjne er denne anvendelse ofte overfladisk. Eksempelvis bruges der sjældent meget tid på valg af GA’ens operatorer, på startpopulationen eller på hvordan effektive parametre til GA’en vælges. I dette speciale har vi derfor implementeret en genetisk algoritme til at løse fylogeni-problemet med hensyn til mindste kvadratmetoden, og vi har undersøgt hvorledes startpopulationen og især hvordan valg af parametre påvirker effektiviteten af GA’en. Vores eksperimenter viser, at en selvtilpassende parameterkontrol giver gode resultater uden et behov for tidskrævende justering, og at GA’en kan gøres hurtigere ved at basere dens startpopulation på en hurtigere heuristik. Desuden har vores GA været i stand til at finde bedre resultater end neighbor joining algoritmen.

i

Contents 1 Introduction 1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2

2 Background and Definitions 2.1 Phylogenetic Trees . . . . 2.2 Distance Methods . . . . . 2.3 Optimization . . . . . . . 2.4 Evolutionary Algorithms .

3 3 5 7 8

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 Representing Tree Topologies 13 3.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Estimation of Branch-Lengths . . . . . . . . . . . . . . . . . . . 15 4 Methods for Phylogenetic Inference 4.1 Branch and Bound . . . . . . . . . . 4.2 Simulated Annealing . . . . . . . . . 4.3 Genetic Algorithms . . . . . . . . . . 4.4 Neighbor Joining Method . . . . . .

. . . .

5 Implementing the Genetic Algorithm 5.1 Initial Population . . . . . . . . . . . . 5.2 Selection Operators . . . . . . . . . . . 5.3 Recombination Operators . . . . . . . 5.4 Mutation Operators . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . .

17 17 17 18 18

. . . . .

21 21 25 30 31 34

6 Controlling the Parameters of the GA 35 6.1 Parameters of a Genetic Algorithm . . . . . . . . . . . . . . . . 35 6.2 Non-adaptive Parameter Control . . . . . . . . . . . . . . . . . 37 6.3 Adaptive Parameter Control . . . . . . . . . . . . . . . . . . . . 38 7 Experiments and Results 43 7.1 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.2 Measure-based Parameter Control . . . . . . . . . . . . . . . . 53 iii

7.3 7.4

Self-adaptive Parameter Control . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8 Future Work 8.1 Tree Topology Search Space . . . . . . . 8.2 Mutation Strength . . . . . . . . . . . . 8.3 Parallel Processing and Multi-population 8.4 Quality Criterion . . . . . . . . . . . . .

. . . . . . GAs . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

56 63 65 65 66 67 68

9 Conclusion

71

A Algorithms

73

B Evolutionary Framework

76

C Parameter Control Taxonomy

80

D Distance Matrices

81

Bibliography

87

iv

Chapter

1

Introduction Without any particular knowledge of biology, one would expect that humans are in some sense more related to gorillas than e.g. penguins. In biology, phylogenetics is the classification of organisms based on how closely related they are in terms of evolutionary differences. One way of showing this kind of evolutionary interrelationship between different species which are believed to have a common ancestor is with phylogenetic trees. A rooted phylogenetic tree describing the proposed relationship is shown in figure 1.1.

penguin

human

gorilla

Figure 1.1: A rooted phylogenetic tree with no branch-lengths showing an evolutionary relationship between humans, gorillas and penguins. It is indicated that humans and gorillas are the closest related of the three since they are the immediate children of the same common ancestor in the tree. The phylogeny problem is then formulated as finding the phylogenetic tree that under a certain optimality criterion best represents the evolutionary history of a collection of species [4]. One such criterion is least squares (sec. 2.2.1) which is a so-called distance-based approach. In these approaches, the evolutionary distances between all species are estimated, and the optimal phylogenetic tree with respect to the particular criterion is constructed. One major problem with the phylogeny problem is that in order to find the optimal solution, one effectively has to evaluate all possible tree topologies. However even for a small number of species, the size of the tree space makes such an approach infeasible (sec. 2.1.1). To illustrate this, assume that the least squares score of 1

2

Introduction one hundred trees can be calculated in one second. It would then take more than two trillion years to test all possible trees with twenty species. Solving the phylogeny is a NP-complete problem [12], and therefore requires some kind of search heuristic—a method which might not always find the best solution, but does find a good solution in reasonable time. In this thesis we describe and discuss the implementation of a genetic algorithm (abbreviated GA, see sec. 2.4) to solve the phylogeny problem with respect to least squares. The GA is a population-based search technique inspired by evolutionary biology, and has been used in several articles on similar problems. In our opinion however, GAs are often used superficially; e.g. there is rarely spent time on the choice of the operators, the initial population or parameters of the GA. In this thesis we will therefore address some of these shortcomings, and examine how the initial population and in particular the parameters influence the efficiency in terms of finding a good solution compared to the neighbor joining algorithm (sec. 4.4). This includes comparing knowledge-based initial populations to the standard random-based approach, tuning of parameters and examining other parameter control methods like self-adaptation. Originally our plan was to implement traditional methods and test these against a GA implementation. This would give us a good idea of the performance of our GA—both in terms of finding good solutions and running-time. We abandoned this approach however; primarily because we shifted our focus, but also because we were unable to find enough existing implementations to test. The focus change was triggered by the lack of attention to parameter setting as already mentioned. This thesis has been written at the Bioinformatics Research Center (BiRC), which explains the biological motivation of the project. Note however, that both authors have only very limited background in biology, and that the emphasis focus therefore will be on computer science.

1.1

Thesis Outline

We start by introducing important terminology and motivate the use of search heuristics for the phylogeny problem (chap. 2). Various data structures used to represent the phylogenetic trees are discussed (chap. 3), and we mention a few traditional approaches of inferring these trees (chap. 4). One method will be presented in great detail; namely the genetic algorithm (chap. 5). We examine different ways of controlling the parameters of the genetic algorithm (chap. 6). Experiments based on these examinations are presented, and the results are discussed (chap. 7). We suggest some ideas to future work (chap. 8), and we finally conclude and reflect on the thesis (chap. 9).

Chapter

2

Background and Definitions In this chapter we introduce phylogenetic trees and distance methods in sections 2.1 and 2.2, respectively. The focus of this thesis is solving the phylogeny problem with respect to one of these methods, namely least squares. In order to do this we present the terminology of general optimization in section 2.3 and of evolutionary algorithms in section 2.4.

2.1

Phylogenetic Trees

Some of the following definitions are inspired by Farach et al. [6]. Definition 2.1 (Topology). The branching pattern of a tree is called the topology of the tree. In other words, it is required that the single path in the tree between two species corresponds to an unique set of internal nodes. The branching pattern of a tree is the same whether it is rooted or unrooted. As we will introduce later, the least squares method only considers the topology of a tree. Definition 2.2 (Phylogenetic tree). A phylogenetic tree of a set of species S is an unrooted binary tree in which the leaves are labeled by the species in S, and the internal nodes represent the ancestors of the species. In some texts phylogenetic trees are defined as multifurcating rooted trees. Our definition is more granular but for the applications of this thesis it will suffice. In fact, it is possible to represent a multifurcating rooted tree by an unrooted binary tree, as demonstrated in figure 2.1. However, some information is lost in the conversion; namely that the root node disappears and that we have to make a choice regarding the topology when transforming multifurcating nodes. Definition 2.3 (Additive matrix). A distance matrix M is called additive, if it is possible to construct a branch-weighted tree T in which the sum of edge-lengths of the path between each species matches M perfectly. 3

4

Background and Definitions root c

a

d

≈ a

b

c

d

e

b

e

(a) Rooted trees can be thought of as unrooted trees by simply removing the root node. The length of the branch to leaf a is updated accordingly.

a

c

a d

b

e

≈

c

d

0

b

e

(b) Multifurcating trees can be thought of as binary trees with some branchlengths set to zero (or some very small ε > 0).

Figure 2.1: Representing a multifurcating rooted tree as an unrooted binary tree will preserves inter-species distances.

As we will discuss later in chapter 4, there exists methods for solving the phylogeny problem given an additive distance matrix. A more restrictive property is called ultrametric, and this type of matrix can be solved even faster. Definition 2.4 (Ultrametric tree). An edge-weighted tree T is called ultrametric if it can be rooted in such a way that the sum of edge-lengths of all root-leaf paths in the tree are equal. Definition 2.5 (Ultrametric matrix). An additive matrix M is said to be ultrametric if it represents the inter-leaf distances of an ultrametric tree. Note that an ultrametric matrix is also additive. The notion of ultrametric trees and matrices is interesting from a biological point of view, as it is closely related to the molecular clock hypothesis (see [15, chap. 10] and [10, p. 453]). This hypothesis roughly states that mutation is approximately constant over evolutionary time, and therefore that if all species are evolved from a single common root node, the distances from each species to this node should be equal. The biological discussion concerning this idea is beyond the scope of this thesis, and we will therefore expect the input matrices to be non-additive (and thus not ultrametric).

2.2. Distance Methods

2.1.1

5

Tree space

For m species, the number of possible phylogenetic trees, Sm , follows easily from the counting argument illustrated in figure 2.2; see equation 2.1.

(a) From the initial 3-species star tree, there are 3 possible insertion point for the new node.

(b) Inserting in either of the previous 3 edges, we get 2 additional insertion points, now totaling 5.

(c) Once again, 2 more insertion points appear. Therefore, a tree with n species has 2n − 3 insertion points.

Figure 2.2: Counting the number of possible phylogenetic trees; at each step the number of possible choices increases by two.

Sm = 3 · 5 · 7 · · · · · (2m − 3) =

(2m − 3)! , − 2)!

2m−2 (m

for m > 2

(2.1)

For instance, the number of possible trees is 34 459 425 for m = 10, and we expect only a single one of these to minimize the least squares measure. Assuming that the score of a particular tree can be found in 0.01 second, it would take roughly four days to try all possible trees with ten species, and a staggering 2.6 trillion (1012 ) years for a mere twenty species. The actual tree sizes we wish to examine depend on the particular application, but it should generally be possible to work on trees containing more than twenty species. In practice, it is therefore infeasible to examine all possible trees.

2.2

Distance Methods

In distance matrix methods, the evolutionary distances are computed for all pair of sequences. This produces a distance matrix D, where an entry Dij is the measured distance between species i and j. How to construct these distance matrices is a study in itself. One way to do this is to align comparable DNA-sequences from a number of species and using a gap scoring system or substitution score system to determine an alignment score, that can be used in the distance matrix. For the purpose of this thesis, the exact method used to find the distances is not important.

6

Background and Definitions In this section we present two distance methods [15]; the least squares method (LS) and the related minimum evolution (ME). The idea is that all possible topologies are examined, such that the tree which yields the smallest value of the method is found. As mentioned, we cannot examine all trees, and in chapter 4 we will touch on a few search methods which has previously been used. Note that the best solution found with respect to LS is not necessarily the best solution to ME, and vice versa.

2.2.1

Least Squares

In the least squares method of phylogenetic inference, we compute the so called residual sum of squares (denoted QLS ) for all plausible topologies. The topology with the smallest QLS -value is considered the best or optimal tree. QLS =

n n X X

wij (Dij − dij )2

(2.2)

i=1 j=1

The values wij are weights which differ from various versions of the LS method and is beyond the scope of this thesis. For computational purposes we assume wij = 1. The value Dij denotes the distance between i and j provided in the distance matrix, and dij is the distance between the two species in the current tree, i.e. the sum of lengths of the edges connecting the two leaves. If the input matrix is additive, we know from definition 2.3 that the best tree fits the distance matrix perfectly. As there is no error between the distances, we get QLS = 0. For non-additive distance matrices, QLS must therefore be larger than 0. Note that in the latter case, we have no idea about the value of QLS , and generally we cannot say that a tree is optimal without checking a lot of solutions. Because the topology contains no information about the edge-lengths, in section 3.2 we present a method which estimates these lengths based on the distance matrix.

2.2.2

Minimum Evolution

In the ME method, we seek to minimize the total evolution suggested by the tree. In other words, the sum QM E of all edge-lengths bi is computed for all possible tree topologies. The topology with the smallest QM E value is considered best. X ˆbi QM E = i

Here, ˆbi denotes the estimated edge-length of the i’th edge in the tree. These estimations are found using the same branch prediction as in the LS method. It was originally the plan of the thesis to solve both of these distance methods. However, as the focus shifted to parameter control, we found that

2.3. Optimization LS was sufficient for our purposes. We mention ME here only for reference with respect to the neighbor joining algorithm in section 4.4.

2.3

Optimization

To introduce the concepts and terminology of problem optimization, we take a look at Rastrigin’s 2-dimensional benchmark problem: P minimize f (x1 , x2 ) = 100 + 2i=1 x2i − 10 cos(2πxi ) (2.3) subject to −5.12 ≤ x1 ≤ 5.12 −5.12 ≤ x2 ≤ 5.12

Figure 2.3: The fitness landscape of the 2-dimensional Rastrigin benchmark function for arguments x, y ∈ [−2.0, 2.0]. In this close-up view of the function, we notice quite a few valleys—these so-called local minima are what makes this particular problem hard for most search heuristics.

• The goal of an optimization problem is to find the optimal (i.e. best) of all possible solutions to a given model. More formally, we wish to find a solution in the feasible region which has the minimum1 value of the objective function; i.e. find x0 ∈ S such that ∀x ∈ S : f (x0 ) ≤ f (x). The domain S of the objective function is called the search space, while the elements of S are called candidate solutions. 1

For simplicity we only consider minimization; maximizing a function f is equivalent to minimizing the function −f .

7

8

Background and Definitions To illustrate these concepts, we have plotted the Rastrigin objective function in figure 2.3. The search space is S = [−5.12, 5.12] × [−5.12, 5.12], and we are trying to solve a minimization problem. We notice a series of valleys in the fitness landscape, and these denotes local minima of the function. The optimal solution—the global minimum—is easily determined algebraically to be (0.0). • Exploration and exploitation are two concepts used to describe the behavior of a search algorithm, see figure 2.4. In exploration the search is allowed to guide the candidate solutions in an inferior direction. Conversely, in exploitation the solutions must always be improved, i.e. be moving towards some optimum.

optimum

solution

(a) Exploration.

(b) Exploitation.

Figure 2.4: Illustration of the difference between exploration and exploitation. The large box shows some part of the search space, and the small white square indicates some optimum. The black circles symbolize the candidate solutions, and the arrowheads the direction the algorithm is leading them. Hill-climbing techniques exploit the best available solution for possible improvement but neglect exploring a large portion of the search space. In contrast, a random search only explores the search space and does not exploit promising regions of the space. Effective search techniques provide a mechanism for balancing the two types of guiding [14, p. 44].

2.4

Evolutionary Algorithms

The terms Evolutionary Algorithm (abbreviated EA) and Evolutionary Computation were introduced to unify the set of optimization techniques inspired by biological evolution; algorithms that use a population of individuals which are selected and altered in an iterative process [22, p. 2]. This includes, but are not limited to, the algorithms listed in the box of figure 2.5. Because there is still no wide-spread consensus on the naming of EAs, we make a short introduction of those we will discuss later in the thesis:

2.4. Evolutionary Algorithms

Neighbour− Joining

9

Branch and Bound

Local Search Hill Climbing Algorithm

Search Heuristics

Simulated Annealing

Differential Evolution

Genetic Programming

Evolutionary Strategies

Evolutionary Programming

Genetic Algorithm

Evolutionary Algorithms

...

Figure 2.5: A few relevant search heuristics with respect to phylogenetic inference. Standard branch and bound is an exact method, and neighbor joining is a deterministic heuristic. Evolution Strategy (ES) In the late 1960s Ingo Rechenberg and Hans Peter Schwefel developed the ES. Initially, it consisted of a single individual and a mutation operator. In each iteration, the parent and its offspring from the mutation is compared, and the best survives. This scheme is known as (1 + 1)-ES and is similar to Hill Climbing. The concept of population was later introduced, and two main evolution strategies suggested; (µ+λ)-ES and (µ, λ)-ES. In both cases a population of µ parents are used to create λ offspring. In the former both parents and offspring are exposed to selection, and in the latter we select µ individuals from the λ > µ offspring. Genetic Algorithm (GA) Due to his work in the early 1970s, John Holland had a large role in making GAs become a widely recognized optimization method. The GA is a generic optimization technique well suited for a wide range of problems. There are no requirements to the representation of individuals, although it is often recommended that it somehow reflects the problem being solved. Recombination of the solutions, as well as mutation and selection, is used. As the title of this thesis suggests, the primary focus is on using GAs—the details of which will be given in chapter 5. Genetic Programming (GP) In 1985 Nichael Cramer introduces one of the first GPs by developing a tree structured GA for basic symbolic regres-

10

Background and Definitions sion. However, it was John Koza (with his LISP-coded GP algorithm from 1989) who was largely responsible for the popularity of GP within the field of computer science. The idea of GP is the following: given a problem, construct a computer program which solves the problem. The individuals are in form of programs as branching data structures, and the GP uses special mutation and recombination operators which expand or shrink these trees. Because of this, the tree structures can grow arbitrarily large as necessary to solve whatever problem they are applied to.

2.4.1

Terminology

All evolutionary algorithms have an initialization phase followed by an iteration phase that evolves the initial population [22, p. 9]. This way, we hope to end up with a better set of solutions to our problem. The execution of n repeated local searches is different than one execution of a population-based heuristic with n individuals. A common way of describing an EA is presented as pseudo-code in algorithm 2.1. The terminology used will be explained in this section. procedure Evolutionary Algorithm t←0 initialize population P (0) while !(termination condition) do t←t+1 select population P ′ (t) from P (t − 1) create population P (t) from P ′ (t) evaluate population P (t) end while end procedure Algorithm 2.1: Pseudo code describing the genetic algorithm.

• An individual or candidate solution is a representative within the genetic algorithm of a possible solution to the particular problem. The GA maintains a number of individuals in a population (denoted P in the pseudo-code). Each individual consists primarily of a genome and a fitness. The genome is comprised of a number of genes which altogether encode the data structure of a solution to the particular optimization problem. • The fitness of an individual represents a measure of the quality of the encoded solution. Fitnesses are computed by a fitness function. For

2.4. Evolutionary Algorithms this thesis, however, the fitness will denote the QLS -value calculated by the least squares method on the tree topology of the individual being considered. When comparing the quality of individuals, we use the terms ‘more fit’ and ‘less fit’ meaning better and worse, respectively. • Each iteration of evolution cycle of the genetic algorithm is called a generation. In the above pseudo-code, t denotes the generation counter. This iterative process continues until a predefined termination condition is met. In the beginning of a new generation t, the selection operator is responsible for selecting P ′ (t) based on the population P (t−1) from the previous generation. Often we are interested in removing less fit solutions while preserving the more fit. This idea is adopted from the notion of natural selection, which states that individuals with favorable traits are more likely to survive and reproduce than those having unfavorable traits. • Creating a new population P (t) consists of two parts; first recombining the individuals of P ′ (t) and then mutating them. The idea behind recombination 2 is to combine the genes of two or more candidate solutions, resulting in an entirely new gene. The idea is inspired by nature where two parents create an offspring, and hopefully the offspring inherits the good characteristics of each parent. We will thus use the terms parent and offspring to denote this relationship. The second part is mutation, which changes a single individual independently of all other candidate solutions. Mutation is often accomplished by adding a bit of stochastic noise to the existing solution.

2

Several authors use crossover as a synonym for recombination. The term originates from molecular genetics where chromosomal crossover is when two strains of DNA recombines. Inspired by this, in GA terminology it often refer to the chromosome being represented using a bit vector. Because we are not doing this, we prefer to use the term recombination.

11

Chapter

3

Representing Tree Topologies In this chapter we first examine various ways in which the topology of a tree can be represented. Next we will go though how the branch-lengths of a topology are calculated based on a distance matrix.

3.1

Data Structure

The primary data structure used to implement the trees is a straightforward combination of nodes and edges as separate objects. This is appropriately named SimpleTree and we do not feel that it needs a detailed explanation. The Java-interface of trees is presented in appendix B.2.1. In this section we mention a couple of other ways to represent a tree topology we thought of using, but found to be impractical: Distance matrices and Prüfer sequences, illustrated in figure 3.1.

3.1.1

Distance Matrix

One way of representing a topology is by using a distance matrix. Note that we are not using the input distance matrix, but instead another which we use to represent and indirectly alter the topology. With a deterministic algorithm, we then map the distance matrix to a topology—it is therefore possible to obtain different topologies by letting the genetic algorithm change the distance matrix. An example can be seen in figure 3.1, which also highlights a few problems with distance matrix representation. • The first obvious problem is redundancy. It is easy to imagine different distance matrices that would result in a tree with the same topology. Because n(n − 1)/2 entries of the matrix can be any value (the diagonal is fixed, and the matrix is symmetric), the search space is effectively made continuous and infinitely large. However, there is still the same number of tree topologies, so finding the right one has just become harder. 13

14

Representing Tree Topologies 0

2

3

4

4

2

0

3

4

4

3

3

0

3

3

4

4

3

0

2

4

4

3

2

0

2

0

4

4

2

3

1

4

0

2

4

3

4

2

0

4

3

2

4

4

0

3

3

3

3

3

0

3

1 6

7

4 8

6

7

8

8

7

6

8

8

6

7

7

5 5

6

7

2 8

4

Distance matrix

6

3

Tree topology

Prufer sequence

Figure 3.1: Two topologies and their corresponding distance matrix and Prüfer sequence representations [17]. The arrows indicate methods of transformation which are included in appendix A. • Another less problematic issue is that we have to convert the distance matrix to a topology before it can be evaluated—e.g. one could imagine using the neighbor joining algorithm. Nei and Kumar [15, p. 108] notes however, that on rare occasions, it can construct tie trees (different trees produced from the same data set). Therefore, the NJ-method might not be as deterministic as needed for this mapping, so we would have to use some other tree construction algorithm. • The third problem—which was the main reason we decided not to use distance matrices as representation—is the inability to change the topology directly. In distance matrices, mutation methods can be things like altering entries, interchanging rows and so on. However, we prefer to be able to manipulate the topology more directly; like swapping leaves, removing and reinserting subtrees. It is not obvious how this should be done on matrices, although we have not given it much thought.

3.1.2

Prüfer Sequence

Prüfer sequences or Prüfer numbers were introduced by Heinz Prüfer in 1918. For an unrooted binary tree with n labels we can construct a 2(n − 2) long sequence to uniquely represent the tree [17], as shown in figure 3.1. The Prüfer sequence provides an easy way to systematically enumerate all topologies. Translating the sequence to and from the corresponding topology can be done easily—see appendix A.3 for details on how to do this. Because the Prüfer sequence is represented using a series of integers, it is straight forward to implement. Keeping the sequence legal is therefore the job of the mutation and recombination operators. Furthermore, Gottlieb et al. [9] have examined

3.2. Estimation of Branch-Lengths

15

the locality of Prüfer sequences with respect to conventional mutation, and found it to be poor. A representation has high locality if a small change to the coding means small change of the corresponding tree. They conclude that using Prüfer sequences in evolutionary algorithms should be avoided. As was the case with using distance matrices, we want a more direct control of the tree. We therefore choose not to use Prüfer sequences as the primary representation. However, we have actually used it in our implementation; both for testing (doing an exhaustive search of small trees in a systematically fashion), and for generating random trees (converting the Prüfer tree to our ordinary data structure).

3.2

Estimation of Branch-Lengths

Given a tree topology we wish to calculate the branch-lengths such that the least squares measure QLS in equation 2.2 on page 6 is minimized. This section presents the standard way of doing this, and the calculations presented are primarily those of Felsenstein [7]. If we label all branches, we define the indicator variable xij,k by ( 1 if branch k lies on the path from species i to species j xij,k = 0 otherwise We can now express the branch-lengths sum dij by X xij,k vk dij = k

where vk is the length of branch k. The measure QLS is therefore expressed as QLS =

n n X X

wij (Dij −

i=1 j=1

X

xij,k vk )2

k

If we differentiate this with regards to some vk (say v1 ), and equate the derivative to zero, we get n

n

X XX dQLS wij (Dij − xij,1 v1 ) = 0 = −2 dv1 i=1 j=1

k

The solution to this equation is n X n X i=1 j=1

wij Dij =

n X n X i=1 j=1

wij

X

xij,1 v1

(3.1)

k

Assuming for the time being that all weights wij = 1. Considering the example in figure 3.2, we can express the vector d and matrix X corresponding to the topology as:

16

Representing Tree Topologies A

C

B

1

3 7

2 6

5

4

D

E

Figure 3.2: The tree topology used in the branch-length estimation example. D

AB



DAC  DAD  D   AE  DBC   d= DBD    DBE  DCD    DCE DDE

1 1 1 0 1 0 1 0  0 1 and X =  0 1  0 1 0 0  0 0

0 0

0 1 0 0 1 0 0 1 1 0

0 0 1 0 0 1 0 1 0 1

0 0 0 1 0 0 1 0 1 1

0 1 0 1 1 0 1 1 0 1

1 0 1  0 1  0  1 1  0 1



Equation 3.1 can therefore be expressed compactly in matrix notation as XT d = (XT X)v ⇐⇒ v = (XT X)−1 XT d

3.2.1

Faster Branch-Length Prediction

Matrix operations are profoundly inefficient; the calculations of the previous section exhibits a time complexity of about O(n3 ). To overcome this, Bryant and Waddell [3] presented an algorithm running in O(n2 )-time for the unweighted least squares method. We will not discuss the details of the algorithm in this thesis, instead the reader is referred to the cited article. We tried to implement the faster algorithm hoping that the speed-up gained would outweigh the time used to implement it. However, we later learned that the faster algorithm was significantly slower than the matrix manipulations of the simpler tree data structure. The performance was compared using a single run of of the genetic algorithm (see chapter 5) with 200 generations and population size 50, a 28 species distance matrix, mutation- and recombination-probabilities set to 0.1 and 0.5 respectively. The genetic algorithm using the simple tree with matrix manipulations took 21 minutes, while using the faster algorithm suggested by Bryant and Waddell it took about three times as long, namely 63 minutes. One main reason for this lack of speed-up is probably that the number of species, n = 28, is too small to outweigh the invisible constants in the O(n2 ) versus O(n3 ) time complexities. Another disadvantage of the faster algorithm was that it required a more complex data structure; for instance that information about the number of species on each side of a particular edge was being maintained. Changes to the tree (by mutation or recombination) required the information to be updated, and thus slowing the algorithm down.

Chapter

4

Methods for Phylogenetic Inference As mentioned in the introduction, it is simply infeasible to try all possible trees in order to solve the phylogeny problem. In this chapter we introduce some traditional search heuristics which has been used to solve it previously. Although they are used with respect to different optimality criteria, the search methods themselves are still applicable to e.g. least squares.

4.1

Branch and Bound

Branch and bound is an exact technique built on the idea of repeatedly partitioning the search space. Wu et al. [24, p. 207] have implemented a branch and bound algorithm for constructing minimum ultrametric trees (see def. 2.4), where the sum of edge-lengths is minimized. It works by searching for better solutions in the so-called branch and bound tree until the optimal solution has been found. The algorithm starts out by using UPGMA (a deterministic clustering algorithm) to find a feasible solution. It then goes back and tries different branches in the tree, and removes those which are found to be worse than the previous solution. In other words, the found solutions enforce a bound on the search. Cotta and Moscato [4] use this approach to find the minimum ultrametric tree for a distance matrix with 20 species in a couple of seconds, and one of size 34 in about 6.5 hours. It is worth emphasizing that it is due to the special properties of ultrametric trees that it is feasible to use an exact algorithm.

4.2

Simulated Annealing

Simulated annealing (in short SA) is a hill-climbing algorithm with the possibility to escape local optima without starting a new search. Compared to the ordinary hill-climber, the SA has some extra features, primarily the concept of temperature. If the temperature is high, the algorithm has a good chance 17

18

Methods for Phylogenetic Inference of accepting an inferior solution—when the temperature drops, so does the probability. The temperature starts high, and is slowly decreased throughout the run. Stamatakis [21] uses SA for inferring phylogenetic trees based on the maximum likelihood (ML) method, which attempts to find the evolutionary model which maximizes the likelihood—the probability that the particular model conforms to the data set. His algorithm uses a combination of SA and hillclimbing, and although a bit slower it manages to find better trees compared to the strict hill-climbing.

4.3

Genetic Algorithms

Cotta and Moscato use an evolutionary algorithm1 to find the minimum phylogenetic ultrametric tree (as mentioned in sec. 4.1). The reason is that they believe ultrametric trees provide a “very good approximation to the optimal solution under more relaxed assumptions” [4, p. 3]. It also allows a convenient way to measure the quality of their results, since the branch and bound solution gives an exact solution, see section 4.1. They claim to get quite good results with traditional mutation and recombination operators. Furthermore, they try using a decoder-based EA—this type of EA utilizes a decoder and an auxiliary search space, instead of searching the normal search space. Apparently this approach yields impressive results for trees up to 34 species compared to altering the tree representations directly [4, p. 9]. In related problems, several articles describe the reconstruction of phylogenetic trees with respect to maximum likelihood using GAs, e.g. Matsuda [13] and Poladian and Jermiin [16]. Shorikhine goes a step further, and has developed a self-adaptive algorithm, again for a maximum likelihood reconstruction of phylogenetic trees [20]. As mentioned in the introduction, we generally feel that there has been too little focus on identifying the good parameters for the GAs. To compensate for these shortcomings, we have in this thesis placed great emphasis on finding and tuning parameters (see chapter 6).

4.4

Neighbor Joining Method

Saitou and Nei [19] developed an efficient tree-building method based on the minimum evolution principle called the neighbor joining (NJ) method. It was developed in part because of the amount of computer time required by the ME method for a large number of species. Therefore, the NJ method does not examine all possible topologies; instead at each stage of the clustering, it 1

The authors say that they use genetic programming—however we would probably classify it as a genetic algorithm according to the description in section 2.4.

4.4. Neighbor Joining Method

19

uses the minimum evolution principle (minimizing the total branch-lengths), as illustrated in figure 4.1. Pseudo code is available in appendix A.2. 3

3 4

2

4

2 A

5

1

5

1

6 (a) The NJ method starts with a star tree. Then clustering of the species is performed, such that we end up with a phylogenetic tree as defined earlier.

6 (b) Assume that species 1 and 2 exhibits the smallest QM E value. We then create a new node ‘A’ connecting these two leaves.

3

3 4

2

2

A 1

4 A

B

1 5

6 (c) In the next step, species 1 and 2 is no longer considered (hence the dashed lines). Assume species 5 and 6 have the smallest QM E value and thus we create node ‘B’.

C

D B 5

6 (d) The procedure is repeated until all species are clustered in a single unrooted tree. This tree is NJ tree.

Figure 4.1: Illustration of the neighbor joining method. Recall, that the value QM E denotes the sum of all branch-lengths, see section 2.2.2. This example is explained in greater details by Nei and Kumar [15, p. 105]. The advantages of NJ is that it is fast, and that it has been widely used in the literature. A weakness is that it is not guaranteed to find the optimal solution with respect to ME unless the distance matrix is additive. In practice however, it still manages to find reasonable solutions on non-additive matrices. We use the NJ method to create the initial population of the genetic algorithm, as we explain in section 5.1.3.

Chapter

5

Implementing the Genetic Algorithm In this chapter we present the details of a genetic algorithm which we will use to search for the optimal phylogenetic tree with respect to least squares. The GA is often visualized as in figure 5.1, and at any point of the iterative cycle, the population consists of valid candidate solutions. This means that the algorithm can be stopped before the termination criterion is met. In the following sections we will present and discuss various operators for the different phases; initialization (sec. 5.1), selection (sec. 5.2), recombination (sec. 5.3) and mutation (sec. 5.4). An overview of the implemented GA framework is available in appendix B. Note that for simplicity the fitnesses used in this chapter are abstract fitness measures; good solutions will have a high fitness value, and conversely a less fit solution will have a lower fitness. In the actual implementation we use the least squares score of the individual. The sections will use an uniform format presenting each operator; first a brief explanation, then a walk-through of how it actually works (scheme), and finally a discussion of the pros and cons of the particular operator (discussion).

5.1

Initial Population

The genetic algorithm needs an initial population of candidate solutions; i.e. a set of phylogenetic trees to start the search from. A main objective for the initial population is to help the genetic algorithm eventually finding the optimal solution. This of course cannot be guaranteed, but to this end the candidate solutions must to some extent be spread throughout the solution space.

5.1.1

Randomly Generated Individuals

The first method for initializing the population that springs to mind is to make it consist entirely of random trees. 21

22

Implementing the Genetic Algorithm Initialization and evaluation

Fitnesses Population

3

9

2

7 4

6 3

2 Evaluation

Selection

9 7 4

6

Recombination Mutation

Figure 5.1: The typical illustration of the genetic algorithm [22].

Scheme Select three species uniformly random over all species. These nodes suggest the smallest possible binary unrooted tree. From the remaining species, select one at random, and insert it at a random insertion point, see figure 2.2 on page 5. Repeat this method until all species have been inserted.

Discussion The idea of using random individuals is illustrated in figure 5.2(a). The large rectangle is an abstract representation of the search space, and the black dots are the individuals chosen for the initial population. An obvious drawback is that we have no control over the distribution of solutions in the search space. Therefore we risk obtaining an initial population where every individual is located in some inferior part of the search space. The main advantage of using a random population is that is easy to implement, and it is widely used because it does not ‘cheat’—we will discuss this in greater detail in the summary (sec. 5.1.4). It gives no guarantees regarding the outcome, but over the course of several runs, there is a good chance of getting satisfactory results.

5.1. Initial Population

(a) Random initialization.

23

(b) Grid initialization.

Figure 5.2: Abstract illustration of the tree search space. The set of black dots represents the initial population of the genetic algorithm.

5.1.2

Grid Initialization

Grid initialization is another widely used method for generating the initial population. The idea is to sample the individuals in such a way that they are scattered evenly throughout the entire search space. In figure 5.2(b) this is illustrated as a grid. Note that we have left out the ‘scheme’-section, because we currently do not know how to implement it, see section 8.1. Discussion Currently, the problem is that there is no structure to the search space; subsequently we cannot sample the trees evenly. For continuous search spaces it is often quite obvious how to this, but for combinatorial problems it is often non-trivial, and we discuss some ideas in future work. A problem with a deterministic initialization is that more runs will start in the same area of the search space. This will work against diversity in our population and thereby increase the risk of missing a global optimum. For grid-initialization this could easily be countered in most cases by moving the start grid a bit for each run.

5.1.3

Knowledge-based Initialization

The idea of basing the population on some expert knowledge has gained support recently [22, p. 20]. Instead of letting the GA start from scratch, we give it a head start by providing it with candidate solutions that we already know are good, as shown in figure 5.3(a). We have decided to go with a mix; combining random individuals with some knowledge-based, figure 5.3(b).

24

Implementing the Genetic Algorithm

(a) Knowledge-based initialization; basing the population on previous good solutions or on the result of other search heuristics.

(b) A combination of knowledge-based and random initialization. A small part of the population is based on some heuristic, and the rest is random in order to disperse the population.

Figure 5.3: Using a knowledge-based approach to creating the initial population. The small white square represents some good solution; either the most fit individual from a previous run, or it could be the result of a fast heuristic.

Scheme Rzhetsky and Nei [18] use the NJ method to obtain a good tree topology. They then examine the topologies of trees whose topological distance is 2 or 4 from the NJ tree for a better minimum evolution tree. Inspired by this, we have chosen to include the NJ tree, and additionally let one tenth of the population be in the neighborhood (i.e. obtained though a number of mutations) of this NJ tree. The initial population, P0 , of the NJ-based population generator is given by following expression. Note that the term P [k] denotes the k’th individual of the population P .

Discussion

  the original NJ tree, P0 [i] = mutation of NJ tree,   random tree,

i=0 i ∈ {10j | j ∈ N} otherwise

We have chosen that one tenth of the population is based on the result of the NJ-algorithm. This ratio is also a fine balance act: If it is too high, we to some extent restrict the GA to only search the neighborhood of the NJ tree. On the other hand, if the ratio is too small we risk accidentally eliminating the NJ-based solutions, and thus be in the same situation as an all-random population. Also, the smaller the number of NJ-based solutions is, the smaller the probability of them being selected for mutation and recombination.

5.2. Selection Operators A good ratio is also deeply influenced by the selection scheme as well as the probabilities of mutation and recombination. To determine the optimal ratio of NJ-based solutions, we would have to test many different settings. We therefore decided to use the aforementioned scheme with a ratio of one tenth of the population.

5.1.4

Summary

Using randomly generated trees as the initial population of the genetic algorithm might seem as a bad idea compared to deterministically generated populations made by NJ. However knowledge-based approaches like NJ, guide the genetic algorithm into a specific part of the solution space. Random on the other hand gives a more distributed initial solution, and thus more ‘openminded’ start. Population initialization, in particular knowledge-based, has not received much attention previously. Ursem [22, p. 25] offers an explanation, namely that “most EA has been used on artificial benchmark problems where the global optimum is known in advance and any application of ‘expert-knowledge’ would be considered cheating”. For instance, look at the Rastrigin’s benchmark function given in equation 2.3 on page 7. The optimum for this function is found at (0, 0)—an argument value that most grid initialization implementations would probably include. The EA would thus find the optimal solution immediately, and we would be left with little clue as to how good the EA really is. Conversely, we have no exact knowledge of the optimal solution to our problems, so we can try schemes like NJ.

5.2

Selection Operators

The selection operator is responsible for choosing which candidate solutions are eligible to survive to the mutation and recombination phases—and on the other hand which are to be eliminated. We use the term psurvival to denote the probability of a particular individual being selected.

5.2.1

Tournament selection

Tournament selection is a widely used selection operator. The idea is that a number of candidate solutions is picked out to fight each other in an arena of death—only one can make it out alive. Scheme The n-way tournament selection is illustrated in figure 5.4 (in this case n = 3). From the population, n candidate solutions are picked out at random and put into a tournament pool. It is in this pool that the chosen individuals fight, and

25

26

Implementing the Genetic Algorithm a single ‘winner’ survives to the next phase of the evolution. This procedure is repeated until the population of winners is full. The same individual can participate in several battles, increasing the chance that the best individuals are carried on. Population

Randomly selected subset

3−tournament pool

Winner

Figure 5.4: Example of 3-way tournament selection. In the general case the winning candidate solution is selected from the tournament pool according to some probability p—the most fit candidate solution is chosen with probability p, the second most fit with probability p(1 − p), the third best p(1 − p)2 , and so on. In the special case with p = 1, called deterministic tournament selection, the most fit candidate solution in the tournament pool is always selected. Discussion Because of time constraints we have chosen not to experiment with the probability p and the size of the pool. Instead we have opted to use deterministic tournament selection with a small pool; namely p = 1 and n = 2. The pool must be kept small in order to keep the new population diverse. If the pool size increases, the more likely it is that the best solution will account for the majority of the new population. Another problem is that it becomes increasingly unlikely that some less fit candidate solution survives. This might not seem like a huge problem—after all we are looking for candidate solutions which are as fit as possible—but we risk population converges too fast on some local optimum.

5.2.2

Roulette-wheel selection

In roulette-wheel selection (also known as fitness proportionate selection), the fitness of an individual is used to determine the probability of selecting it.

5.2. Selection Operators

27

Scheme Figure 5.5 illustrates the analogy to a roulette wheel; each candidate solution is represented by a various sized pockets on the wheel. The size of the pocket is chosen such that good solutions are more likely to be selected, and bad solutions less likely. In addition this means that solutions with good fitnesses are less likely to be eliminated—there remains a chance that they will be.

Fitnesses 3

9

30%

2

10% 7%

7 5

4

23%

13% 17%

Figure 5.5: Example of the roulette-wheel selection. The total fitness is 3 + 9 + 2 + 4 + 5 + 7 = 30, so in this case the candidate solution with fitness 9 will account for 9/30 = 30% of the entire roulette wheel. When the wheel is spun, the winner is copied to the next generation. This is repeated enough times for the surviving population to be selected. Discussion In our case we attempt to solve a minimization problem. This means that candidate solutions which are close to the optimal have lower fitness values (due to the lower least squares score), and conversely bad solutions have higher fitnessvalues. Therefore we must first map the fitness of each candidate solution to a suitable value in order for the scheme to work. To illustrate how there can be a problem with this, we take a look at the following two examples: In the first table 5.1(a), assume that we have fitness-values (calculated least squares scores of three candidate solutions) 1.0, 2.0 and 4.0. As mentioned, we are trying to solve a minimization problem, so we want 1.0 to be selected more often than 2.0 which in turn should be selected more often than 4.0. If we map the fitness values with the function m : x 7→ 1/x we get survival rates 0.6, 0.3 and 0.1, respectively. The requirement with good solutions being more likely and bad solutions less likely is clearly satisfied. In the other table 5.1(b), we notice that the three candidate solutions are about equally likely to be selected. The problem is that we do not know the optimal value; if it is 0.0 the probabilities seem fine. However, let us

28

Implementing the Genetic Algorithm

Fitness x

m(x)

psurvival

Fitness x

m(x)

psurvival

1.0 2.0 4.0

1.00 0.50 0.25

0.57 0.29 0.14

43.0 44.0 46.0

0.0232 0.0227 0.0217

0.343 0.336 0.321

(a) The optimal tree has fitness 0.0; the candidate solution with fitness 1.0 is relatively twice as good as the one with fitness 2.0. In this case the mapping function depicts this relationship.

(b) In this data the best achievable tree has fitness 42.0 (which is assumed to be unknown at the time of testing). This means that, similarly to the other example, the candidate solution with fitness 43.0 is in fact twice as close to the best tree as the solution with fitness 44.0.

Table 5.1: Example of using roulette wheel selection. The table illustrates using the mapping function m : x 7→ 1/x. When the best achievable fitness value is unknown this mapping might not be sufficient.

suppose that the best achievable tree for this particular data set has fitness 42.0 (assumed to be unknown at the time of testing). This would mean that— similarly to the other example—the candidate solution with fitness 43.0 is in some sense twice as close to the best tree as the solution with fitness 44.0. To sum up, the drawback of roulette wheel selection, is that the selection pressure depends on the relative fitness of the individuals. Therefore a few very good candidate solutions can quickly take over the entire population [22, p. 23]. Because of this we prefer to use ranking selection instead.

5.2.3

Ranking Selection

Ranking selection assigns a chance to pick a certain individual based on a ranking of the individuals according to their fitness. The best individual has a certain chance to be chosen, the second best a smaller chance and so on. The point is that the actual fitness does not influence the chance to be chosen.

Scheme We have chosen to implement a ranking selection where the individual with the worst fitness is assigned one share, the next two shares and the best gets n shares (where n of course is the population size). To find the surviving individual, all the shares are summed and a random number between one and the sum is found. The individual corresponding to this share, is then selected. In figure 5.6 an example of this can be seen. The procedure is repeated until the new population is full.

5.2. Selection Operators

29 2

3

9

4 5

7

Figure 5.6: Example of the ranking selection. The individual with best fitness gets n shares, in this case six, while the worst gets one. Rank

Fitness

psurvival

Rank

Fitness

psurvival

1 2 3

9.0 7.0 5.0

0.286 0.238 0.190

4 5 6

4.0 3.0 2.0

0.143 0.095 0.048

Table 5.2: The selection probabilities of the example in figure 5.6.

Discussion This selection scheme is not flawless; a primary shortcoming is that a large set of candidate solutions with nearly identical fitnesses are not equally likely to be selected. Unlike roulette-wheel selection, ranking does not need to know the fitness of the optimal solution. Because of this we opt to use ranking selection instead of the other selection operators in our tests.

5.2.4

Elitism

Elitism is not a selection operator per se; it is the idea of making sure that the best candidate solution in the beginning of a generation is also available in the start of the next generation.

Scheme Find and keep a copy of the best candidate solution in the beginning of the generation. When the generation cycle is completed, the found solution is reinserted in the population. We have used an elitism of one individual in all experiments.

30

Implementing the Genetic Algorithm Discussion Because mutation possibly makes the previous best candidate solution less fit, the elitism operator ensures that the best candidate solution bypasses mutation and is saved during the selection phase. The advantages are quite obvious; the best candidate solution does not deteriorate. Therefore, if we suddenly choose to stop the evolution, we can be certain that no candidate solution with better fitness than the best available solution has been observed.

5.3

Recombination Operators

Recombination is our first means of altering the individuals in the GA’s population. The idea is borrowed from the biological concept of sexual reproduction; that two individuals mate and create an offspring, which ideally exhibits the best traits from each parent. Therefore recombination is an operator that combines two individuals to generate a new individual.

5.3.1

Prune-Delete-Graft Recombination

We have chosen the most popular method of recombining two trees; namely the Prune-Delete-Graft (in short PDG). The idea is that the offspring gets a subtree from one parent, while the rest comes from the other parent. Scheme Cotta and Moscato [4] describe a version of this method working on rooted trees—because our candidate solutions are unrooted trees, we made small changes to the presented algorithm. PDG receives the two trees, T1 and T2 , to recombine. The method starts by creating a copy of T1 called Toffspring , and then selects a subtree Tsubtree in T2 at random. All species in this subtree are removed from Toffspring , and Tsubtree is then attached to a random edge in the remainder of Toffspring . This procedure is illustrated in figure 5.7 and detailed pseudo-code is provided in algorithm A.1 on page 73. Discussion Since there are several requirements for phylogenetic trees, normal removal and reattaching of subtrees will not work. The recombination method must make sure that the created tree has n uniquely named leaves—representing the n species. Also, since the trees are binary, the number of internal nodes is fixed. PDG is the most popular recombination method for phylogenetic trees. The reason is quite clear, as it is hard to come up with recombination types that are as conceptually simple and actually transfer features from the parents.

5.4. Mutation Operators

a

b

c

f

31

a e d (a) The first parent, T1 .

Tsubtree

b (b) The second parent, T2 . A subtree containing the species a and b is marked. b

insertion point a

a

b c

c

f

f

e d (c) The copy Toffspring with the species in Tsubtree removed. An edge is selected as insertion point.

e d (d) The result of the recombination which is finally found by attaching the chosen subtree to the selected edge.

Figure 5.7: Illustration of PDG recombination. If we want more control of how much the offspring is different from the parents, it is possible to select the subtree depending on its size. Both very small and very large subtrees will generally have a smaller impact than medium sized subtrees. We will touch on this a bit more in section 8.2.

5.4

Mutation Operators

Mutation is the second way to alter a candidate solution in the genetic algorithm. It has been inspired by biological mutation, and attempts to maintain some genetic diversity in the population. One purpose of mutation is to allow the algorithm to escape being trapped in local minima or avoiding them altogether by preventing individuals from becoming too similar to each other.

5.4.1

Swap

Swap is a mutation operator which alters a tree by swapping two leaf-nodes. Scheme The swap operator is fairly simple; it picks two different leaf-nodes at random and swaps them, as illustrated in figure 5.8. There is no swapping of two leaves connected to the same internal node, since that does not change the topology or the fitness of the tree.

32

Implementing the Genetic Algorithm b

c

e

b

a

a

d h

d

h swap

g

g e

f

c

f

(a) Swapping species ‘c’ and ‘e’. . .

(b) . . . and the result.

Figure 5.8: The mutation operator Swap.

Discussion The change to the fitness score is hard to predict. If the two leaves are located close to each other, the fitness is generally not affected much. On the other hand, if the leaves are far apart, it might have a larger effect on the score.

5.4.2

NNI

Nearest neighbor interchange (in short NNI) is a widely used method to mutate a tree. It does so by swapping two neighboring subtrees. Scheme NNI works by first randomly selecting an edge of the tree, which then induces four subtrees as illustrated in figure 5.9(a). Two of the non-sibling subtrees are swapped, meaning that the method picks one of the two possible outcomes of fig. 5.9(b) at random. a

b

a

d

a

c

c

b

b

d

edge c

d

(a) NNI is performed on the specified edge.

(b) The two possible outcomes.

Figure 5.9: Illustration of the NNI method.

5.4. Mutation Operators

33

Discussion Again, the change to the fitness is a bit difficult to predict, as it depends on the length of the selected edge between the two subtrees. We have favored NNI since its topological change is more local than the other methods. An alternative would be that the method picks the one of three trees which yields the lowest least squares score. We have chosen not to use this approach because we wish to allow solutions to become worse, and let the selection operator handle which solutions survives to the next generation. Cotta and Moscato [4, p. 4] describe a version of NNI in which only two neighboring leaves are swapped. Yet another idea is to use a concept of mutation strength, as proposed in section 8.2.

5.4.3

SPR

Subtree pruning and regrafting (SPR) moves a subtree in the tree. Scheme The SPR operator chooses a random subtree, removes it from the tree and inserts it on a random edge in the remaining tree, as shown in figure 5.10. The method ensures that the tree remains a valid binary tree. b

c b

c

a

e d

h

a d

h

g f

e

(a) Moving the subtree containing species ‘e’ and ‘f’ as indicated. . .

f

g (b) . . . and the result.

Figure 5.10: The mutation operator SPR

Discussion There is the risk that the subtree is inserted the same place it was removed from, resulting in the same topology and therefore the same fitness. On the other hand, half the tree could be moved, potentially representing a huge change in both topology and fitness.

34

Implementing the Genetic Algorithm As with NNI it is possible to examine all possible outcomes (trying all edges as insertion point) and then choose the best topology with regards to the least squares score.

5.5

Summary

In this chapter we have discussed a wide range of operators for the GA, and to present it more clearly, we have summarized them in table 5.3. Operator

Method

Initialization

X Random Grid Knowledge-based X NJ-based hybrid

Selection

Tournament Roulette-wheel X Ranking X Elitism

Recombination

X PDG

Mutation

Swap X NNI SPR

Table 5.3: The various GA operators presented in this chapter. A check mark denotes that the particular method has been used extensively in our experiments.

Chapter

6

Controlling the Parameters of the GA The connecting thread of this chapter (theory) is closely intertwined with the next (practice). On one hand, the theory presents ideas, and it is the goal of the experiments to either support or reject these. On the other hand, it is the outcome of the tests which guides the theory forward. We found this to be the least disruptive way to present it, as it in addition allows theory and practice to be read independently. We start by introducing the parameters of the genetic algorithm (sec. 6.1). Then, we examine non-adaptive parameter control, where parameters are either fixed or changed in some fixed predetermined way (sec. 6.2). We attempt to find a optimal set of parameters through tuning of some sample data, but we realize that such a set cannot be obtained feasibly. In an effort to overcome these shortcomings, we then look at adaptive parameter control, where parameters change in reaction to the progress of the search (sec. 6.3). Figure 6.1 shows a taxonomy of the various parameter control techniques.

6.1

Parameters of a Genetic Algorithm

An important part of developing an evolutionary algorithm is to find the optimal values for the parameters of the algorithm. This section lists the common parameters of a genetic algorithm. In addition to these, each used operator might introduce a number of new parameters; the size of the pool in tournament selection, the ratio of random and NJ-based individuals, etc.

6.1.1

Population size

The population size, popsize, determines how many candidate solutions are allowed in the genetic algorithm. Because the recombination operator creates new individuals, it is the job of the selection operator to make sure that only popsize solutions survive to the new generation. 35

36

Controlling the Parameters of the GA

Parameter control

Non−adaptive

Constants

Functions

Adaptive control

Measure−based

Self−adaptive

PSB

Figure 6.1: Parameter control of a genetic algorithm occurs either nonadaptively or adaptively; i.e. in some predetermined way or based on the progress of the evolution, respectively. PSB is an acronym for Populationstructure-based which we will not discuss in this thesis. This figure was originally presented by Ursem [22], and differs from the one suggested by Michalewicz and Fogel [14, p. 286] and Eiben et al. [5] (see appendix C). The main difference is that the latter focuses on when the parameters are set, namely before or during the run. Ursem argues that “parameters are seldom determined without performing a few test runs, which makes the criterion when inadequate for distinguishing between techniques”.

In his study of genetic algorithm parameters, De Jong found that “a small population size improved initial performance while large population size improved long-term performance” [11]. Population size obviously have a linear impact on the running time, so we do not want the population to be too large. But if the population is too small, we risk searching too little of the search space. We therefore need to balance these concerns.

6.1.2

Number of generations

Denoted ngen , the number of generations limits how long the genetic algorithm is to run. Generally, the longer time we let the GA work, the more fit the best solution will become. One will rarely have enough time to try all solutions, so it is necessary to find a suitable value. Note that the genetic algorithm is not guaranteed to find the optimal solution even if it is given enough time to potentially try out all topologies. For this very reason, we cannot just let the GA run until the optimal solution is found—in contrast to brute force and branch and bound approaches, which examines the entire search space systematically.

6.2. Non-adaptive Parameter Control

6.1.3

Mutation rate

The mutation rate, pm , denotes the probability of a particular individual being mutated. Note that in the literature, pm is sometimes used to represent the possibility of flipping a particular bit, if the individuals are encoded using a bit vector1 . This makes it hard for us to translate the recommended parameters to our implementation. Mutation rate is an important parameter in the GA and to some extent depends on the recombination rate and the selection scheme. Since the effect of the mutation is random, it is not possible to know how often individuals are improved. There is no general guideline for the value of pm , but the higher it is, the slower the GA will be. From a performance point of view, we are therefore interested in keeping pm low. On the other hand, if the mutation rate reaches zero, the individuals will not be improved at all.

6.1.4

Recombination rate

Because the term crossover is traditionally used in preference to the term recombination, the recombination rate is most often referred to as pc in the literature. As in the case of mutation, pc indicates the probability of a particular individual being subjected to recombination. Similarly to mutation rate, we cannot give some a priori good values for the rate. The literature states that the recombination rate generally should not be less than 0.6 [5, p. 133].

6.2

Non-adaptive Parameter Control

In their study, Eiben et al. [5] lists constant parameters used in the literature; e.g. pm = 0.001 and pc = 0.6 (De Jong), and pm = 0.01 and pc = 0.95 (Grefenstette). Cotta and Moscato [4] use pm = 0.01 and pc = 0.9 in their experiments. As these examples demonstrate, there is no single setup for all problems; that “the scope of ‘optimal’ parameter setting is necessarily narrow. Any quest for generally (near-)optimal parameter settings is lost a priori” [5, p. 125]. Note that the listed mutation probabilities refer to the chance of flipping a bit in a bit-vector chromosome. In our implementation, the parameter denotes the chance of the entire individual being mutated.

6.2.1

Constants

We start by turning our attention to the first type of parameter setting; nonadaptive control of constants; the parameters are fixed throughout the run. The good thing about this approach is its simplicity, we only have to find a single set of optimal parameters. The hope is that the mantra “parameter setting by analogy” is valid; that good parameters on one distance matrix are 1

In these cases the pm is often in the order of 0.005, depending on the size of the encoding.

37

38

Controlling the Parameters of the GA also good on other distance matrices. To this end, we examine the effect of changing the mutation and recombination rates. How this so-called tuning is performed is presented in section 7.1.

6.2.2

Functions

In section 7.1.3, we conclude that parameter tuning is infeasible; it is simply too time consuming and no single set of good parameters can be found. Therefore we will briefly mention the next non-adaptive parameter control in which we use functions. The idea is that the parameters change in some predetermined fashion; often this would be based on the generation counter. One popular usage of this is simulated annealing, where the algorithm in the beginning emphasizes exploration, but as time passes, the focus is gradually shifted to exploitation. Eiben et al. [5] mention an experiment by Fogarty in which he used a function scheme to decrease the mutation rate over time. An improvement over constant parameters was observed, emphasizing the opportunities of this approach. However, as the function scheme is a generalization of constant parameter control, it still requires extensive tuning. We therefore decided not to experiment much with this type of parameter control, and instead look at adaptive parameter control approaches.

6.3

Adaptive Parameter Control

Adaptive parameter control is a step further than using functions to control the parameters. Feedback received from the GA during the run is used to change the parameters dynamically. In their paper, Eiben et al. [5] list examples of adaptive parameter control used previously. They mention that Bäck has tried decreasing the mutation rate depending on the distance to the optimal solution (as opposed to decreasing it based on the time, which was mentioned in function parameter control). This is very nice if you actually know the optimal solution already—which you typically do not. Other examples include the widely used Rechenberg’s “1/5 success rule”, which we will go in depth with in this section. Bäck [1] uses self-adaptive parameter control by adding the mutation rate as a part of the individuals being evolved. The expectation is that bad rates die, while the good rates survive. His idea was used by Fogarty and Smith implementing a steady state GA (special type of selection) and also using Rechenberg’s 1/5 success rule for the mutation [5, p. 133].

6.3. Adaptive Parameter Control

6.3.1

Measure-based

Measure-based parameter control is that the parameters changes based on feedback from the genetic algorithm. This could be the number of generations since the last improvement, the ratio of beneficial mutations, and so on. This approach is unlike the deterministic in which you simply start the run and hope that it works. Here the algorithm tries to take the current status of the algorithm into account to decide how to proceed. Impatient Rule The impatient rule is a scheme we have developed to keep the solutions improving. It increases the rate of mutation and/or recombination if no good results have been found for some time. If there are several improvements within a small period of time the algorithm will decrease the rates again. Also if the rates become 0.0 or 1.0 the algorithm will reset the rates. Obviously, this idea assumes that increased rates will result in better results and it might sound convincing if we consider the following. If more individuals are changed per generation then there is a greater chance that some of them are better than the current best. However, there are two problems: First, increased rates does not actually increase the chance to get better results compared to the time used. If that was the case then rates of 1.0 would be preferred at all times, since we always want better results. Secondly, the increased rates are not ‘free’. It takes time to recombine or mutate an individual, so even if higher rates generally resulted in better results, then it would take longer than if we could achieve the results with smaller rates. A variation could be that the algorithm after a significant number of generations without improvements decides to restart to see if a different initial population can give better results. That would be a more radical change, but the idea could just as well be used as a general termination condition. Rechenberg’s One-fifth rule A good example of adaptive parameter tuning is Rechenberg’s “1/5 success rule”. This rule states that one fifth of all mutations should be successful—if the result of a mutation is better than its parent, it is considered successful. If there are more successful mutations then the step size should be increased— which means that the changes are bigger and the chance of a worse result is slightly larger. On the other hand, if less than one fifth of the mutations are successful, then the step size is decreased. Rechenberg’s 1/5 rule has been used, with modifications, several times. Julstrom varied the ratio between mutation and recombination based on their performance [5, p. 133]. Each operator is used separately and rewarded based on its recent contributions.

39

40

Controlling the Parameters of the GA We have used a variation of Rechenberg’s rule. We try to change the “step size” every generation, so every generation has 1/5 successful mutations. Our way of changing the step size is by changing the mutation type. Since we have 3 different mutation types we simply switched between them, regarding NNI as the smallest change, swap as the medium change and SPR as the largest change. A better solution would probably be to implement mutation strength (as described in sec. 8.2). Discussion Perhaps because we lacked proper control of the size of our steps, our tests with impatient rule and one-fifth rule (described in sec. 7.2.3) were not very good. The results showed that impatient rule was the best, but it used a lot more time than one-fifth. The results did not convince us that we had found the best scheme. The main problem was that it was required to tune the parameters for the two schemes. Because we continue to struggle with tuning the parameters, we decided to try something completely different, namely self-adaptation.

6.3.2

Self-adaptive

Finding suitable parameters manually, both in deterministic and adaptive parameter control, is a lot of work. However, when a GA can evolve good solutions, then it should also be capable of evolving good parameters. This idea leads us to self-adaptive GAs; which, while running, determine the best parameters for the current problem. Ideally, a self-adaptive algorithm explores not only the search space of the problem at hand, but also the ‘parameter’ search space. In other words it evolves the parameters, which in turn are used to find the best solution. Another advantage is that it does not assume anything about how the parameters work. It is illustrated in figure 6.2 that the individuals in the self-adaptive genetic algorithm have been extended with the parameters. When the candidate solution is altered, the parameters pm and pc are changed in addition to the tree encoding. Tree encoding

Parameters pc pm

Figure 6.2: The individual used in self-adaptation. Instead of only containing the tree, it now also contain its own parameters. The initial values of these parameters are currently assigned by the following scheme: Let P = {(0.1, 0.1), (0.2, 0.2), . . . , (1.0, 1.0)}, and let P[k] denote

6.3. Adaptive Parameter Control the k’th member of this set. The parameters of the i’th individual in the population is then (pim , pic ) = P[i mod 10 + 1]. In the recombination phase, we use the pc probability value to determine if the corresponding individual is subject to recombination. When two individuals are recombined, the recombination and mutation rates of the offspring is determined using arithmetic crossover: p′ = r · p1 + (1 − r) · p2 The values p1 and p2 represent the rate of the parents, and r ∈ [0, 1] is a random number. Mutation is the next phase of the GA and starts out by mutating the rates itself. It adds a Gaussian (normally) distributed value to parameters: p′ = p + N (0, 1) · 0.05 Here N (0, 1) denotes the standard normal distribution, i.e. normal distribution with µ = 0 and σ = 1. The new value p′c is then used to determine if the corresponding candidate solution should be mutated. Discussion The results from our experiments with self-adaptation in section 7.3.2 were quite good. Compared to our previous ideas, self-adaption was definitely better and over-all less time consuming. Not all of the runs were equally good; e.g. some results would improve very slowly, so ideas for improving the scheme are still needed. We have mentioned some of our suggestions to improve on our results in different ways in chapter 8.

41

Chapter

7

Experiments and Results In this chapter we present the experiments and results of the parameter control techniques mentioned in the previous chapter; tuning of the constant nonadaptive parameters (sec. 7.1), measure-based adaptive methods (sec. 7.2) and finally self-adaptation (sec. 7.3). The main test bed for the experiments was a machine pool of eight computers running Red Hat Fedora Core 5 Linux—each with a hyper-threaded 3.20 GHz Intel Xeon processor, 2048 kB cache and 1024 MB RAM—assigned to BiRC1 . A typical memory usage of the implementation was less than 25 MB, and if no other processes were running, it used 100 percent of the available processing time. The input used in this chapter is presented in appendix D. In case the reader would like to try the GA implementation, we have made all our test results available on-line2 , and the input used is presented in appendix D. Note that due to the random behavior of the genetic algorithm, our results are most likely not perfectly reproducible. However, taking the average of a number of runs, the results should exhibit some similarity.

7.1

Parameter Tuning

As mentioned in section 6.2.1, we have designed a series of tests in an attempt to find some good parameters for the genetic algorithm. We say that parameters are good if the GA, by using these parameters, produces the most fit individuals. In this section we will first explain how the experiments are conducted, and then present and discuss the results. 1

The machines birc01—birc08 seemed slightly unstable; during the experiments, we would often find that the computers had been rebooted. Although not directly influencing our results (the tests could to some degree be resumed), the time wasted would deny us some additional testing due to the dead line drawing nearer. 2 Available at http://www.daimi.au.dk/˜fischer/speciale/testresults/.

43

44

Experiments and Results

7.1.1

Set-up

Considering the GA as a black box, we cannot make assumptions as to which parameters are good—had we in advance been able to reason as to which parameters are good, there would be no need for tuning. Therefore, we will resort to the customary practice of tuning, which is as follows: 1. Identify the parameters pi that we wish to tune, and a number ngen denoting how many generations we allow the algorithm to run. 2. Construct sampling sets Pi with numbers selected evenly distributed from the possible values of pi . For instance, suppose that pk is some probability measure, i.e. pk ∈ [0; 1]. If we decide to sample |Pk | = 11 parameter values, we would make Pk = {0.0, 0.1, . . . , 1.0}. 3. Let P = P0 × P1 × · · · be the Cartesian product of the samplings. Run the algorithm using each parameter tuple in P. As mentioned, we wish to tune the genetic algorithm with regard to the mutation and recombination probabilities. Recall that these are denoted pm and pc , respectively. Number of generations Before choosing the ngen there is some issues we need to consider: • If ngen is too low, we would be denying the GA of the opportunity to evolve properly. What this means is that the initial population, and the luck of a few recombinations and mutations would determine the outcome. Therefore the results would be mostly random, and we would have little idea about the quality of the tested parameters. • On the other hand, if ngen is too large, the amount of computation needed would make the tuning infeasible. Additionally, if we already know that the GA never will be allowed to run more than e.g. 400 generations, there would be little point in setting ngen > 400. We feel that ngen = 150 balances these two concerns, and it has been used in all experiments listed in this section. However it should be noted that although we find some parameter pair to be good after 150 generations, we have really no idea if it is still good after additional iterations of the GA. The reason for this is that an increased number of generations might represent a change in the problem we are trying to solving. In other words, the good parameters of a long-term strategy might not be applicable to a more shortsighted approach.

7.1. Parameter Tuning Sampling size Step two of the tuning process involves deciding the number of samples which goes into Pi . This also takes some consideration: • If |Pi | is large, we get a very detailed overview of which parameters are good and which are not. The problem is that by increasing the number of samplings, the computation time will rise significantly. • On the other hand, if |Pi | is too small, the result will be a crude overview of the parameters. Due to the low level of detail, we can expect many good parameters to be hidden, and thus go unnoticed. • In practice, one would often start out with a small number of samplings. When a promising area of pairs is found, one would then focus on this region and perform a more finely sampled tuning. Because pm and pc are probability measures, we have chosen to sample them in |Pm | = |Pc | = 11 parameter values (as in the example). We will not be using a multi-staged tuning, because at this point we are mainly interested in the distribution of the quality of the parameters. Repeating the runs Each experiment is repeated five times to get the average and thus make the effects of lucky random events smaller. Although five runs are perhaps a bit too few, the computational burden of increasing the number of generations is outweighed by the smaller number of experiments we can conduct. We are of course interested in good results, but if one of the five runs is noticeable better than the remaining four (and the remaining four are equally fit), we cannot claim the GA to perform well using the tested parameters—it was simply lucky. Note that we have not performed any statistically analysis on the results in order to decide whether to remove a single result. We present the average of the five runs, regardless of any obvious deviating results. Test Following the discussed approach to tuning, for each problem we run the GA with parameters (pm , pc ) ∈ {(0.0, 0.0), (0.0, 0.1), . . . , (1.0, 1.0)}. By problem we mean which distance matrix (see appendix D), selection operators (ranking/tournament) and initial population (random/NJ-based) we test. With NJ-based initial population we refer to the generator in section 5.1.3, where every tenth individual is a mutation of the NJ tree. The GA is run for 150 generations, and is repeated five times.

45

46

Experiments and Results

7.1.2

Results

Graphs in this section are presented in figures 7.1–7.7. To interpret what each of these tells us, we first examine the results presented in figure 7.1 in detail.

1.0

2.6

2.4 0.8

3.0 2.2

2.5

0.6 2.0

Mutation

LS 2.0

1.8

0.4

1.5 1.0 0.0

0.8 0.2

1.6

Re 0.4 co mb in 0.6 at ion

0.2

M

0.4

ut at ion

0.6

1.4 0.2 0.8

1.0

0.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Recombination

(a) Three-dimensional graph

(b) Contour map

Figure 7.1: Tuning the mutation- and recombination-rates on the ABA_WDS distance matrix, with 2-tournament selection and random initial population. The figure consists of two sub-figures: to the left a three-dimensional graph (a) and to the right a contour map (b). Both present the same information; the average best least squares score after 150 generations of the genetic algorithm. As mentioned, each run is conducted five times and the average best is plotted in the figures. From the caption of the figure, we learn that the GA gets the ABA_WDS distance matrix as input, uses 2-tournament selection, and that the initial population is random. The two horizontal axes of the three-dimensional graph are the pc - and pm -rates, and the vertical axis represents the fitness. Note that the fitness of the neighbor-joining method on the various distance matrices can be seen in table D.1 on page 81. The graph clearly shows how the fitness changes as a result of changing parameters. It is a little hard, however, to see which combination of rates are associated with the smallest least squares scores. Therefore we present the results using the contour map, which uses a nonlinear mapping to the fitness to a red-blueish color range. The range can be observed on the bar to the right, and is designed such that the top-15% fitness are clearly distinguishable (colored red) from the rest (colored in different shades of white to blue). As we expect, when there is no mutation or recombination—i.e. when (pc , pm ) = (0, 0)—we end up with the fitness of the best individual of the

7.1. Parameter Tuning initial population. In the cases where we use a NJ-based initialization, it is unlikely that the randomly created trees will produce a better tree than those based on the NJ algorithm—and even more improbable that it would happen more than once of the five runs. In some tests we observe that the (0, 0)-entry is notably less fit than all other entries, and in these cases we might opt to omit the (0, 0)-entry of the data set in the contour mapping. Otherwise it is not really possible to differentiate the good fitness from the less fit, because everything would be colored red, while the (0, 0)-entry would be blue. Returning to the figure, we notice that some good parameter-pairs are located in the neighborhood of (0.1, 0.5), i.e. a recombination rate of 0.1 and a mutation rate of 0.5. Other good combinations are (0.1, 0.8), (0.4, 0.7) and (0.6, 0.8). We suggest that these are the most likely parameters to produce good results after a larger number of generations. Identifying the Good Parameters In this section we will do a pair-wise comparison of the graphs in the order in which they are presented. We eventually hope to find some overall good parameters. Comparing figs. 7.1 and 7.2 In these two experiments the GA uses a 2-way tournament selection and a randomly generated population. The only difference is the two distance matrices. We observe fewer good parameters in 7.2 compared to 7.1. And only (0.0, 0.4) and (0.1, 0.4) appear to be overlapping; i.e. being good parameters in both experiments. The goal of this comparison is to obtain some parameters which we can use with the genetic algorithm. The hope is that by using these good parameters with some unknown input data, the GA will be able to find the best solution efficiently. Most often we would keep the operators the same, tune the parameters on a few known distance matrices, and then run the genetic algorithm using the tuned parameters on some new distance matrices. Comparing figs. 7.2 and 7.3 Contrary to the previous, we now keep the distance matrix the same, and change the selection operator: 2-way tournament and ranking selection, respectively. In both cases only a few good parameters are overlapping, and (0.3, 0.7) appears to be a good choice. As we mentioned in the last comparison, we would seldom be trying to find good overlapping parameters in experiments where operators are changed. The potential difference in the behavior of various operators are so great that it would make little sense to try to find such parameters. For instance, we would expect a completely random selection to behave quite dissimilar to a selection scheme which only takes the top-2 candidate solutions.

47

48

Experiments and Results

25

1.0

0.8

20

20

LS

Mutation

0.6

15

0.4 15 1.0

10 0.0

0.8 0.2

0.2

M

0.4

ut at ion

0.6

Re 0.4 co mb in 0.6 at ion

0.2 0.8

10

1.0

0.0

0.0

0.0

0.2

0.4

0.6

0.8

1.0

Recombination

(a) Three-dimensional graph

(b) Contour map

Figure 7.2: Tuning the mutation- and recombination-rates on the A4_EXTRA distance matrix. 2-tournament selection and random initial population.

1.0

24

0.8

22

20

20

0.6

Mutation

LS 15

18

0.4

16

1.0 0.0

0.8 14

0.2

Re 0.4 co mb in 0.6 at ion

0.2

M

0.4

ut at ion

0.6

12

0.2 0.8

1.0

0.0

0.0

10 0.0

0.2

0.4

0.6

0.8

1.0

Recombination

(a) Three-dimensional graph

(b) Contour map

Figure 7.3: Tuning the mutation- and recombination-probabilities on the A4_EXTRA distance matrix. Ranking selection and random initial population.

7.1. Parameter Tuning

49

1.0

120

150 0.8

115

140

0.6

Mutation

LS

130

120

110

110 0.4

1.0 0.0

0.8 0.2

Re 0.4 co mb in 0.6 at ion

0.2 105

M

0.4

ut at ion

0.6

0.2 0.8

1.0

0.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Recombination

(a) Three-dimensional graph

(b) Contour map

Figure 7.4: Tuning the mutation- and recombination-probabilities on the on Adeno_E1A distance matrix. Ranking selection and random initial population.

Comparing figs. 7.3 and 7.4 Same operators again, but this time compared to a new matrix. Unlike the previous graphs, in figure 7.4 we observe a region of good parameters in the neighborhood of (1.0, 0.7). These parameters are entirely bad in 7.3. Therefore, we once again find very few good overlapping parameters. So far we have looked exclusively at experiments with GAs using a random initial population. We have noticed that changing either the distance matrix or the selection operator greatly influences the choices of good parameters. In general, the experiments so far have showed no good parameters in the lower right half of the graph. Recombination rates less than 0.3 in combination with mutation rates higher than 0.4 seem to limit the area of good solutions. Comparing figs. 7.4 and 7.5 This time the genetic operator uses the same distance matrix and operators. The difference is the initial population; random and NJ-based, respectively. We notice that in 7.5 all good parameters are in the lower half of the map. This is of course quite contrary to what we just discussed in the previous paragraph regarding the experiments using random initialization. However, like changing the selection operator, the initial population is seldom something one would wish to change. It corresponds to a very large shift in the problem that the genetic algorithm is trying to solve. Starting with a NJ-based population, the easiest way of enhancing the solution is probably to be exploiting the good solutions—the trees which are based on the NJ method. On the other hand, if the initial population is random, the GA would perhaps

50

Experiments and Results

1.0

66

68 0.8

67 65 66

Mutation

LS

0.6

65

64

64 0.4

63 1.0 0.0

0.8 0.2

Re 0.4 co mb in 0.6 at ion

0.2

63

M

0.4

ut at ion

0.6

0.2 0.8

1.0

0.0

0.0

0.0

0.2

0.4

0.6

0.8

1.0

Recombination

(a) Three-dimensional graph

(b) Contour map.

Figure 7.5: Tuning the mutation- and recombination-rates on the Adeno_E1A distance matrix. Ranking selection with the NJ-based initial population.

be allowed a higher degree of exploration because a better solution is more easily found.

3.920

1.0

3.915 3.98 0.8

3.910 3.96

3.905

0.6

Mutation

LS

3.94

3.92

3.900

3.90 0.4 1.0 3.895 0.0

0.8 0.2 3.890

0.2

M

0.4

ut at ion

0.6

Re 0.4 co mb in 0.6 at ion

0.2

3.885

0.8

1.0

0.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Recombination

(a) Three-dimensional graph

(b) Contour map

Figure 7.6: Tuning the mutation- and recombination-rates on the A4_EXTRA distance matrix. Ranking selection with the NJ-based initial population.

7.1. Parameter Tuning

51

Comparing figs. 7.5 and 7.6 We use the same GA settings as before, having changed only the distance matrix. It is immediately noticed that in 7.6 the parameter choice appears to be completely indifferent for the outcome. We also notice that there is in fact an improvement to the (0, 0)-entry. Therefore the best solutions are most likely not found by the initial population, but should be found in the immediate neighborhood of the NJ solutions. By immediate neighborhood we mean that it is found though a small number of mutations and/or recombinations. If the genetic algorithm is allowed to run for 150 generations, it would thus be likely to come across the solution regardless of the mutation and recombination rates. We repeated the test shown in figure 7.6, but found the graphs to be strikingly similar, so it has been omitted. Comparing figs. 7.3 and 7.7 In this final comparison, we see two different outcomes of the same experimental set-up. They both have some good parameters in the upper left quadrant, but the characteristic feature of figure 7.3—the good parameters in the neighborhood of (1.0, 0.7)—is no longer present.

22

1.0

20 0.8

20

18

LS

Mutation

0.6

15

16

0.4 14

10

1.0 0.0

0.8 0.2 12 0.2

M

0.4

ut at ion

0.6

Re 0.4 co mb in 0.6 at ion

0.2

10

0.8

1.0

0.0

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Recombination

(a) Three-dimensional graph

(b) Contour map

Figure 7.7: Making another tuning of the mutation- and recombinationprobabilities on the A4_EXTRA distance matrix. Ranking selection with the random initial population. This is discouraging at best; either tuning is entirely pointless, or we have just been very unfortunate in some run. It is conceivable that in trying to solve the phylogeny problem with our implemented genetic algorithm, the operator parameters have little or no effect on the outcome. Instead it might be the random population, the random behavior of the operators combined with plain luck which are the main determinants of the result. As mentioned, each tuning

52

Experiments and Results was repeated a number of times in order to cancel out any deviating outcomes. Therefore luck should be a non-issue, but five runs might not be enough. Another explanation is that some part of the implementation had changed between the two runs. The two experiments were conducted September 26 and November 6, respectively. We found that the code had in fact been slightly rewritten, but we find it unlikely that it influenced the results in such way.

7.1.3

Summary

The main motivation of parameter tuning is to find a single set of parameter pair to the GA. The purpose of these parameters is for the GA to perform optimally on a new unknown distance matrix. However, as was most noticeable in figures 7.3 and 7.4, good parameter pairs in one experiment might be entirely bad in the other. We observed that changing the selection operator, the initial population and distance matrix being worked on, all had a prominent influence on what was considered good parameters. We expect similar observations when changing the other operators, and therefore we cannot expect good parameters to be applicable to new operators. pc

Adeno

A4

pc

Adeno

A4

0.0 0.1 0.2 0.3 0.4 0.5

22.6 h 27.3 h 32.2 h 36.3 h 41.5 h 45.6 h

11.2 h 16.7 h 19.2 h 21.0 h -

0.6 0.7 0.8 0.9 1.0 Total

49.8 h 55.1 h 67.5 h 64.7 h 69.2 h 21.3 d

26.4 h 27.8 h 29.9 h 38.1 h 42.5 h 11.3 d

Table 7.1: Time used for tuning the GA in the experiments of figures 7.5 and 7.6. As mentioned, for each of the listed pc entries, we need to perform five runs with eleven pm -values each. The two missing time measurements were lost due to the computer being restarted towards the end of the experiment. As we have stated a few times already, the tuning is very computational demanding. Table 7.1 shows the actual time needed for performing the tuning. There are of course several ways of speeding thing up, but the times listed illustrates just how cumbersome the tuning process really is. In short; we have not been able to find any over-all good parameters, and the amount of computation time required to perform the tuning is huge. With this in mind we conclude that parameter tuning is not really a feasible approach3 . 3 The hidden agenda of section 7.1 has been to discourage the use of tuning. This is the main reason why we felt it unnecessary to use a two-stage tuning, and why we refrained from extensive tuning of more distance matrices with the same GA settings.

7.2. Measure-based Parameter Control

7.2

Measure-based Parameter Control

In this section we present the experiments of the measure-based parameter control described in section 6.3.1.

7.2.1

Impatient

We conducted a few experiments using the impatient rule. The idea of impatient rule is to increase the rate of mutation and recombination if a better result is not found within a specified number of generations. Setup of test Runs with and without a NJ-based initial population were made, and in both cases pm and pc were set to 0.5. To make the results less influenced by randomness we have taken ten runs and averaged them. We have also chosen to do 1000 generations with a population of 100 to increase the chance that the GAs without NJ can catch up with NJ-based GAs. We did this because we realized that the NJ-based runs would experience a major head start. Ranking selection was chosen as selection method due to the discussion in section 5.2.3. Finally the impatient rule was setup to wait ten generations before increasing the probability of both mutation and recombination by 0.05. If a better individual is found two generations in a row, it will decrease the mutation rate by 0.05. In case the rates exceed 0.95 or go below 0.06, they will be reset to 0.5

7.2.2

One-fifth Rule

To make a direct comparison we decided to make some runs with one-fifth rule. It basically follows Rechenberg’s rule, where 1/5 of the mutations should be successful, i.e. result in a better individual. Depending on the number of successful mutations the type of mutation is changed. Again ranking selection was chosen as selection method. Setup of test Since mutation is the most important in this type we have decided to set pc = 0.2 and pm = 0.5. As in the previous test, the GA ran for 1000 generations with a population of 100. We did ten runs with NJ-based and ten runs with random and calculated the averaged best results.

7.2.3

Results for Impatient and One-fifth

In figure 7.8 the results of both impatient rule and one-fifth rule are presented. The experiments from using one-fifth rule show that the results found using

53

54

Experiments and Results 24 Impatient Best Fitness with NJ Impatient Best Fitness no NJ One Fifth Best Fitness with NJ One Fifth Best Fitness no NJ

22

20

18

16

LS

14

12

10

8

6

4

2 0

200

400

600

800

1000

Generations

Figure 7.8: Graph showing the averaged results from 10 runs with the onefifth rule and the impatient rule. This is done on the A4_EXTRA distance matrix with size 34.

NJ start very close to the best result from the run and it can hardly be determined if it even improves from the first generation to the last. A closer examination shows that average overall best of the runs is 3.8966 and the first generation starts with 3.955, so there is a very small improvement. The results with random initialization are improving slowly, but it seems like the rate of improvements slows down and the results do not get close to the results from NJ. Looking at the graph for impatient rule it is clear that the runs with NJ are very fast to find a good result, and only very small improvements are made during the 1000 generations. Without NJ the results slowly improve and end up close to the results from the runs with NJ. Looking at figure 7.9 the results obtained using the one-fifth rule without NJ again show a slow improvement which seems to stagnate over time. The results using NJ start very low, and keep improving a little. On the graphs showing the experiments with impatient rule we notice that the results using NJ are again very good and are in about 200 generations very close to the best result. The results without NJ improve very fast in the beginning and continue to improve. It does not come close to the results with NJ, but it looks like 500 to 1000 generations more would improve the results further.

7.2. Measure-based Parameter Control

55

160 Impatient Best Fitness with NJ Impatient Best Fitness no NJ One Fifth Best Fitness with NJ One Fifth Best Fitness no NJ

150

140

130

LS

120

110

100

90

80

70

60 0

200

400

600

800

1000

Generations

Figure 7.9: Graph showing the averaged results from 10 runs with the onefifth rule and the impatient rule. This is done on the Adeno_E1A distance matrix with size 41.

7.2.4

Discussion

Using both rules it is clear that NJ is very efficient at finding good results. Especially the results using impatient rule are very good when NJ is used. The results for one-fifth rule without NJ are improving, but only very slowly. The possible reasons for this can be found in poor choice of mutation and recombination rates, or that the scheme of changing the mutation type simply does not work. Because we have no direct control of the mutation strength (see sec. 8.2), it is likely that Rechenberg’s one-fifth rule does not translate well to our version. In addition, we could also have tried using different recombination schemes. Impatient rule seems to ensure a slow and steady improvement of the results, and was noticeable faster than one-fifth. A reason for its good results could be because it tries many different individuals due to its potentially high mutation and recombination rates. We have no exact data about the rates, but we know indirectly that they all lie in the range from 0.5 to 0.95. To explain the big difference between the results from impatient rule and one-fifth rule, it helps to examine the running times listed in table 7.2. It is clear that the runs with impatient rule generally takes significantly longer time than those with one-fifth rule. This is at least part of the reason for the difference in results, but we cannot be sure that even with more time, one-fifth would give better results. We can think of three possible reasons for one-fifth giving consistently worse

56

Experiments and Results Experiment

Impatient rule

One-fifth rule

random NJ-based random Adeno_E1A NJ-based

106 h 142 h 36.4 h† 218 h

49.7 h 48.0 h 92.7 h 90.3 h

A4_EXTRA

Table 7.2: Running times for all ten runs of the experiments with impatient and one-fifth rule. The measurement marked with † seems out of place—we have no clue why it ran so much faster. results: The first is that the start parameters are not good enough. Secondly, that we have too little control of the mutation, i.e. we need better ways to change the step size. The third problem could be that Rechenberg’s rule in our case is futile. The first problem can be overcome with tuning, but that would be time consuming. The second problem can possibly be solved by using a concept of mutation strength (described in sec. 8.2). The third problem can only be solved by using another scheme. Although the results made with the impatient rule were quite good, we still believe that the rule can be improved upon. Perhaps the start rates were bad, and different parameters are needed for mutation and recombination. The only solution is again to perform extensive tuning of the parameters. So, a general problem is that without the tuning of parameters we cannot conclude if the schemes are bad or just not tuned correctly. The one-fifth rule needs big changes to become better, while impatient seems better, but to improve it we would need to tune even more parameters than in the nonadaptive approach.

7.3

Self-adaptive Parameter Control

We now present the experiments using self-adaptation from section 6.3.2.

7.3.1

Setup of Test

To more clearly see the difference from one run to the next, it was decided to conduct four runs and plot the results individually. We increased the number of generations to 1500 to give each test without NJ time to “catch up” with the NJ-based runs. In order to compare these results with the previous tests, we kept the population at 100 and used ranking selection. Results In figure 7.10 we notice a big difference between the different runs. Run 1 improves very slowly until 800th generation where it suddenly improves swiftly

7.3. Self-adaptive Parameter Control

57

and ends up below the NJ result after 1100 generations. Runs 2–4 improve more rapidly and pass beneath NJ after only 500–700 generations. The best fitness obtained is 60.65 and the worst is 65.22. 160 Run 1, best fitness Run 2, best fitness Run 3, best fitness Run 4, best fitness NJ result

150

140

130

LS

120

110

100

90

80

70

60 0

200

400

600

800 Generations

1000

1200

1400

Figure 7.10: Graph showing the best results from four runs with selfadaptation. This is done on distance matrix Adeno_E1A of size 41 and random initial population. The experiments depicted in figure 7.11 use a NJ-based initialization. All four runs follow each other closely and within 200 generations they all have a fitness below 61, and they then stagnate completely with the best individuals ending up having fitnesses of 60.65 (with small variances on the following digits).

7.3.2

Discussion

The self-adaptive algorithm with random population generally yields good results, but run 1 of figure 7.10 clearly shows that it might not give good results fast every time. The experiments using NJ start with excellent results and could actually have been stopped after a few hundred generations. It would be obvious to compare the results with the results from the runs with impatient and one-fifth rules in figure 7.9. In this comparison selfadaptation has much better results than both impatient and one-fifth when no NJ is used. After 1000 generations the self-adaptive runs are all below 75 while impatient ends at 75. Impatient runs with NJ gives some nice results, but takes longer to stop improving and still ends at 60.68 that is marginally more than the 60.65 that all the runs with self-adaptive have already reached

58

Experiments and Results 69 Run 1, best fitness Run 2, best fitness Run 3, best fitness Run 4, best fitness NJ result

68

67

66

LS

65

64

63

62

61

60 0

200

400

600

800 Generations

1000

1200

1400

Figure 7.11: Graph showing the best results from four runs with selfadaptation. This is done on distance matrix Adeno_E1A of size 41 with NJbased population. after 1000 generations. The average running time for the 1500 generations was 20 hours per run, so compared to impatient and one-fifth rule 1000 generations would take 13 hours and with 10 runs that would be 130 hours, where impatient runs on Adeno_E1A of size 41 takes 218 hours and one-fifth takes 90.3 hours. So self-adaptation turns out to give nice results without being too slow. It generally yields good results and is fast to improve on results from NJ.

7.3.3

Comparing Self-adaptive with Constant

To determine how useful self-adaptation is, we have decided to compare it with runs using constant rate. Setup of Test Four new distance matrices were selected to determine how good results the self-adaptive approach give on different distance matrices compared to using constant rate. The GAs were set to run 500 generations five times with a population size of 50. This size is perhaps a bit low, especially for the selfadaptive setting since that will only make about five individuals with each probability, and thus making it less robust to randomness. They use random initial populations. Three configurations were tried on each distance matrix:

7.3. Self-adaptive Parameter Control

59

• CR1 is a test with fixed values of mutation and recombination rate, i.e. constant rate. It uses the overall best parameters from the parameter tuning tests (see sec. 7.1): pc = 0.1 and pm = 0.4. • SA is a test with self-adaptation. The parameters are initialized in the same way as in the previous test runs. • CR2 is also a test with constant rate, and uses parameters based on the average rates from all SA runs. These are set to pc = 0.8 and pm = 0.1. Note that this knowledge is not available before the experiments with SA has been concluded. Note that the parameters found by the self-adaptive GA are not necessarily good if used in a constant rate parameter control. The reason is the same as mentioned with the number of generations (see sec. 7.1.1); namely that it is essentially two different problems. Results 50 CR1 SA CR2 NJ result 45

40

LS

35

30

25

20

15 50

100

150

200

250 Generations

300

350

400

450

500

Figure 7.12: Graph comparing the results of a run with self-adaptation and runs with two different runs with constant rate parameters. The distance matrix is Adeno_E1B_55K of size 28. In figure 7.12 the CR1 performs considerably worse than both of the others even if it seems to be closing the gap a bit when the others stagnate a bit around 350 generations. SA and CR2 are very close to each other, but CR2 gives slightly better results.

60

Experiments and Results 40 CR1 SA CR2 NJ result

35

30

LS

25

20

15

10

5

0 50

100

150

200

250 Generations

300

350

400

450

500

Figure 7.13: Graph comparing the results of a run with self-adaptation and runs with 2 different runs with constant rate parameters. The distance matrix is Adeno_E3_14_5 of size 31.

Figure 7.13 shows that CR1 again gives worse results than SA and CR2. It is consistently much worse. SA performs quite well, but CR2 is clearly better for most of the run. SA is worse than CR2 from generation 50 and until it catches up around generation 470. SA does not catch up in figure 7.14 and CR1 is once again consistently getting bad results, with CR2 being the best. SA and CR2 are improving quite well while CR1 is improving rather slowly. In figure 7.15 we see CR1 performing clearly worse than SA, which again is only slightly worse than CR2 through most of the run.

7.3.4

Discussion

It is clear that the parameters in constant rate are very important; bad parameters are likely to result in bad results while the self-adaptive runs gave consistently good results. That being said, the good parameters in CR2 gave better results than the self-adaptive runs, but of course they have to be found first. The self-adaptive runs would probably benefit more from an increase in population size than constant rate, but the fact that the self-adaptive scheme does not need any tuning and requires no knowledge of the problem makes it a convenient alternative to the non-adaptive parameter control. Looking at measures besides the results, it is worth noting the running times as presented in table 7.3. It is clearly noticed that the CR1 runs are

7.3. Self-adaptive Parameter Control

61

40 CR1 SA CR2 NJ result

35

30

LS

25

20

15

10

5

0 50

100

150

200

250 Generations

300

350

400

450

500

Figure 7.14: Graph comparing the results of a run with self-adaptation and runs with 2 different runs with constant rate parameters. The distance matrix is AAL_decarboxy of size 37.

40 CR1 SA CR2 NJ result

35

LS

30

25

20

15 50

100

150

200

250 Generations

300

350

400

450

500

Figure 7.15: Graph comparing the results of a run with self-adaptation and runs with 2 different runs with constant rate parameters. The distance matrix is ABG_transport of size 39.

62

Experiments and Results Size

CR1

SA

CR2

NJ

28 31 37 39

2.83 h 4.00 h 7.25 h 8.62 h

5.28 h 8.33 h 12.6 h 16.9 h

4.53 h 6.35 h 11.5 h 13.7 h

0.71 s 1.09 s 1.33 s 1.70 s

Table 7.3: Running times for the test comparing CR1, CR2, SA and the NJ algorithm. Note that the GAs ran for 1500 generations.

much faster than the test with self-adaptation, and that CR2 is slightly faster than SA and has better results. The reason why self-adaptation is slower, is partly due to the time it takes to alter the parameters, and to a lesser extent that it performs some extra statistical computations. Another value recorded by all the runs is the average number of unique individuals per generation. For performance reasons, this is measured every ten runs and the individuals are only compared to the other individuals in the current population (not the individuals from previous generations). The selection type will obviously have a big impact on this measure, so it is important that the selection type is the same for the different runs. The measure also shows if the algorithm is truly testing many different individuals or only testing a few individuals. Size

CR1

SA

CR2

28 31 37 39

41.2 % 42.2 % 41.0 % 41.6 %

58.4 % 62.8 % 54.4 % 60.8 %

55.4 % 56.0 % 55.0 % 56.2 %

Table 7.4: Average number of unique individuals for the test comparing SA with CR1 and CR2. The number of unique is found every 10 generations.

In table 7.4 we see that the self-adaptive runs generally are a bit more diverse than CR2 and much more diverse than CR1. More unique individuals could indicate that a larger part of the search space is being covered, making the results more likely to be good. SA has more unique, but CR2 has almost as many so we cannot say that SA is better in this way. Self-adaptation is still a good scheme and even if it runs slightly slower and the results are slightly worse than a well-tuned constant rate implementation, it has good results and does not demand any prior knowledge of a problem.

7.4. Summary

7.4

Summary

In this chapter we have examined how different parameter control techniques influence the genetic algorithm. We are especially interested in which cases the GA is able to find better solutions than the neighbor joining algorithm. It is obvious that it is unable to compete on running time. It was found that both constant non-adaptive and measure-based adaptive parameter controls require extensive tuning before they can be used properly in practice on a particular distance matrix. For the former, it was furthermore shown that the same parameters cannot be expected to work on other input. Then self-adaptation was examined. This approach removes any need for tuning and we found it to work quite well compared to the previous methods and to neighbor joining. In addition, we noticed that the end-parameters evolved with SA were greatly superior to those found by the tuning process. The NJ-based GAs gave good results quickly, and were in all experiments able to improve on the solution found by the NJ algorithm. Without this headstart, the GA using the self-adaptive approach was to be the most successful on new input data.

63

Chapter

8

Future Work We did not put much effort into making the implementation fast and effective. At one point we tried implementing the faster LS algorithm by Bryant and Waddell [3] as mentioned in section 3.2.1. We failed to gain any performance, and we therefore decided to focus on other areas. As a result, good candidates for code-optimization include the data structures and operators of our GA. In this chapter we will suggest future work in addition to improving the code; representing the search space of tree topologies, the concept of mutation strength, speeding things up by utilizing parallel processing, and finally we mention a quality criterion for genetic algorithms.

8.1

Tree Topology Search Space

In the beginning of this thesis we explained how big the search space of tree topologies is. The definition of this space is quite clear, however as we have mentioned a few times already, it lacks any notion of locality. One idea which we mention here, is to use a small number of metrics, and using these to calculate a set of distances to some origin-tree. These distances can then be used to plot the tree in a multi-dimensional coordinate system accordingly, as illustrated in figure 8.1. In order for this scheme work, we need would at least need it to meet the following criteria: Injective The mapping from actual topologies to the search space needs to be injective (a one-to-one function). In part, because no two trees should have the same coordinates. More importantly, because we should be able to perform the inverse mapping; namely given a set of coordinates it should be possible to obtain the topology. Locality If two trees are nearly identical, they should be close in the search space. Conversely, if they are very different, they should be far apart. 65

66

Future Work Quartet distance

NNI distance

Similarity distance

Figure 8.1: Using various metrics to represent topologies in a three dimensional coordinate system. It must be designed such that there is an one-to-one relationship between topology and coordinates. Furthermore, if two topologies are significantly different, they should not be neighbors; i.e. locality should be preserved. Gottlieb et al. [9] also list heritability as a crucial part of a coding implementation; solutions generated by recombination should combine features of their parents.

8.2

Mutation Strength

When trying to solve a continuous problem (as opposed to discrete in our case) using genetic algorithms, mutation is often accomplished by adding some Gaussian/normal-distributed noise to the solution. This means that small changes are more likely to be applied than large changes, How can we do this in our combinatorial search space, when we have no search space as suggested in the previous section? The solution is to be able to change the mutation strength. A problem with our current mutation, is that we have no control over how strong the mutation actually is. Mutation strength corresponds closely to Rechenberg’s notion of step size in evolutionary strategies (sec. 6.3.1). Assume that we are looking at NNI where we need to find an edge to mutate. One idea is to choose this edge according to its length; short edges does only little changes to the LS score, while performing mutation across some very long edges does makes larger changes to the LS score. Therefore if the mutation strength is low, short edges are favored; conversely long edges are preferred, if the strength is high. We have illustrated this idea in figure 8.2. In this case the current strength is 0.45. The Gaussian-like distribution indicates the probability of the particular edge being selected for mutation. In this case we are most likely to choose

8.3. Parallel Processing and Multi-population GAs

67

either edge 4 or 5 for mutation. There is a small chance of selecting edges 3 or 6, while it is very unlikely for the remaining edges to be mutated. 0.0

2

0.45

1.0

1

5 7

probability

current strength

8 6 4

9 3

1

(a) A tree with nine edges; names ordered according to length.

2

3

4

5

6

7

8

9

(b) Probability distribution of the edges. Lower axis is edge name, upper axis is strength.

Figure 8.2: Example illustrating the concept of mutation strength. A similar distribution for the other mutation and recombination operators could be made. In general, we feel that these methods should be fairly easy to implement. It would then allow the one-fifth rule (see sec. 6.3.1) to work using mutation strength, as was originally intended by Rechenberg. Currently our implementation works by changing the mutation type, which really does not ensure the mutations to be consistently stronger or weaker.

8.3

Parallel Processing and Multi-population GAs

Brauer et al. [2] implemented a parallel version of GAML (a genetic algorithm for solving maximum-likelihood). It works by having a master processor create a population of individuals, and then distributing each individual to separate processors to be scored. A slightly less than linear speed-up is observed. This performance gain, however, decreases as the number of processors increases. The reason for this is mainly the increased communication overhead involved; data being sent to and from the various processors. The idea of a multi-population GA is presented in the same article. When maintaining several populations, it appears as though each population explores different parts of the solution-space. The suggestion in the article is therefore to evolve each population until the fitness converges, and then to mix and recombine some of the individuals in them. We previously stated that one run of GA with ten individuals has more possibilities than ten runs of a local search algorithm. Similarly, a multi-population GA with ten GAs has a wider range of possibilities than a single GA run independently ten times. Taking things a step further, the final idea is to combine parallel processing with the multi-population GA, as illustrated in figure 8.3. Each processor

68

Future Work Master

Slave 0

Primergy

Slave 1

Primergy

Slave m−1

Primergy

Figure 8.3: The concept of parallel multi-population genetic algorithm.

maintains its own population, and when the fitness converges the master processor takes care of mixing the individuals. The speed-up is expected to very close to linear, as there would be little communication between the processors.

8.4

Quality Criterion

How do we convince ourselves that the genetic algorithm is any good? Currently, we look at the least squares score of the single best individual of a particular run, and this score is the primary measure of how good the GA is. Obviously, based on this score alone, we cannot tell if the entire population has been trapped in some local optimum, if the GA on different runs searches completely disjoint regions, and so on. Therefore we must consider how the search space is covered and to what extent this space is reproduced on various runs of the algorithm. Wehrens et al. [23] and Reijmers et al. [17] present four quality criteria for comparing genetic optimization algorithms: • Coverage of relevant search space. • Reproducibility of the coverage of the relevant search space. • Coverage of the total search space. • Reproducibility of the coverage of the total search space. Because our goal is to minimize the LS measure, we could define the search space as being ‘relevant’ if the solution in question is near-optimal. However, since we do not know the optimal solution (yet alone its LS score), we would have to suffice with near-best; i.e. close to the best solution found in some run. Thus the relevant search space is subject to change if a later run uncovers a better solution.

8.4. Quality Criterion The coverage of the two search spaces is determined by how the candidate solutions of the GA is spread throughout the search space. However, the lack of a search space as mentioned in section 8.1 makes this hard to measure. If all the individuals quickly converge on a near-optimal solution, we are not really happy, because this might indicate that the search is trapped in a local minimum. Determining the reproducibility of the coverages is also important, as it gives some idea of the predictability of the genetic algorithm. It is accomplished by performing a series of runs, and keeping information on the how the search space is explored. If the observed coverages are consistent in all runs, we are assured that the GA is not exhibiting too random behavior. On the other hand, we might experience that in half the runs the search space is nicely explored, and in the other half the search is trapped in some optimum. In situations like these we would have little confidence in how the GA will perform in a single independent run. Currently, we only have the score of the best solution, the average score of the population, and finally an average percentage of unique individuals. This percentage is calculated every tenth generation by doing a pair-wise comparison of the candidate solutions. If a particular solution does not match another in the population, we say that it is unique—note that we do not compare the individuals of different generations. These measurements give us scarce information about the coverage of the search space, and more work is needed to properly evaluate the four criteria.

69

Chapter

9 Conclusion

To solve the phylogeny problem with respect to least squares, we have implemented a genetic algorithm. Starting by tuning the rates of mutation and recombination, we conducted experiments to find good parameters for the GA. However, we quickly realized that poor choices of these parameters cause the GA to end up with bad solutions. It was also clear that even though good constant parameters give good results, the process of parameter tuning is a very time consuming task, and that they are not guaranteed to be good for unknown input. The logical answer to these shortcomings was to examine adaptive parameter control mechanics. Based on theory and several small tests, we decided to experiment with an implementation of Rechenberg’s “1/5 success” rule and of our own impatient rule. The results looked reasonable, but because of the new possibilities in the two schemes, we now needed to tune even more parameters. We therefore needed a method that would work for several problem instances without too much tuning of parameters. The solution was self-adaptation, in which mutation and recombination rates are included in the individuals. This approach showed good results compared to runs of the impatient and one-fifth rule in our first tests. In addition it was far superior to the constant rate approach, because it was able to find good solutions on new data sets. In addition to controlling the parameters, we looked at ways to give the GA better start conditions. We tried a knowledge-based initialization, in which the result of neighbor joining was included in the population. Tests showed that on the distance matrices used, we could quickly find an improvement over the NJ algorithm. Additional experiments showed than even without a known heuristic to the particular problem, the GA still managed to come up with good solutions. Of the tested parameter control mechanisms, we found the self-adaptive GA to be the best search method. An obvious question for many problems is whether the NJ solution by itself is good enough. We have shown that using the NJ-based self-adaptive GA the solutions are improved compared to NJ. The cost of this improvement is time. 71

72

Conclusion It is therefore up to the user if the need for better solutions outweighs the additional time needed to achieve these. We have presented some ideas for future work in chapter 8. We especially find the concept of mutation strength promising and not too time consuming to implement. An issue which has influenced our work is that our tests took a very long time to conduct. The elaborate solution would be to optimize our code or use parallel processing—the simpler solution would have been to use smaller distance matrices. We did finish all the tests we had planned however, so we suggest these ideas as future work. In closing, we have in this thesis demonstrated that by using a genetic algorithm-based approach to solve the phylogeny problem, it has been possible to achieve better results than the neighbor joining algorithm. Therefore, if an improvement to the solution is considered worth the increased running time, the method presented here is a sound choice.

A Algorithms A.1

Prune-Delete-Graft algorithm

The PDG algorithm which we have implemented is based on the one by Cotta and Moscato [4]. The pseudo code is presented in figure A.1, and an example is illustrated in figure 5.7. procedure PDG Recombination(tree T1 , tree T2 ) select a subtree T from T2 for every leaf-node l in T do remove l from T1 while keeping the tree legal end for select a random edge e in T1 insert T in T1 by an edge to the middle of e end procedure Algorithm A.1: Prune-Delete-Graft recombination.

73

74

Algorithms

A.2

Neighbor Joining

We use a slightly modified version of the neighbor joining algorithm presented by [7, pp. 166–168]. A description of our implementation is given in algorithm A.2, and the helper function, Join, is given in algorithm A.3. The algorithm is illustrated in figure 4.1. procedure NeighborJoining(distance matrix D, set S of species) while |S| > 2 do P for all s ∈ S, compute us ← t6=s Dst /(n − 2) choose i and j for which Dij − ui − uj is smallest Join(i, j, D, S) ⊲ Join species i and j using helper function end while connect the two remaining species l, m by a branch of length Dlm end procedure Algorithm A.2: Pseudo code describing the neighbor joining algorithm.

function Join(i, j, D, S) construct a node vij and make branches with lengths: |(i, vij )| = Dij /2 + (ui − uj )/2, |(j, vij )| = Dij /2 + (uj − ui )/2 compute distance between vij and the remaining species as Dvij ,k = (Dik + Djk − Dij )/2 remove i and j from D and replace them by vij end function Algorithm A.3: Helper function for the neighbor joining algorithm; joins two nodes i, j ∈ S and updates the given distance matrix D and the set of species S accordingly. The node vij is afterwards considered as a super species.

A.3. Construction of the Prüfer Sequence

A.3

Construction of the Prüfer Sequence

Given a tree, we calculate the Prüfer sequence by the method of Reijmers et al. [17] as presented in algorithm A.4. Note that in the article it says rightmost instead of leftmost. procedure Prüfer Encoding(tree T ) label all leaves and nodes while more than two leaves remain do i ← the lowest numbered leaf; j ← the node that is connected to i j is the leftmost digit of the Prüfer sequence remove leaf i from the tree and, if necessary, update the set of leaves end while end procedure Algorithm A.4: Encoding a Prüfer sequence. Conversely, given a Prüfer sequence a1 a2 · · · an−2 ∈ {1, . . . , n}n−2 , we wish to find the corresponding topology. Each node’s degree is one more than the number of times the node’s label appears—because we assume binary trees, each node should appear twice in the sequence. To identify the edges of the topology, we use the algorithm listed in algorithm A.5. This version differs slightly from the one presented by Gottlieb et al. [9], which does not assume that the trees are binary. procedure Prüfer Decoding(Prüfer sequence a1 a2 · · · an−2 ) i←1 repeat find the node v of degree 1 with the smallest label the edge (v, ai ) is in the tree topology decrement the degrees of v and ai ; increment i until all nodes have degree 0, except two with degree 1 edge between the remaining two nodes is in the topology end procedure Algorithm A.5: Decoding a Prüfer sequence.

75

B Evolutionary Framework This chapter presents the evolutionary framework which we developed for the project1 and gives the reader a notion of how to extend it with his own ideas.

B.1

Overview

The framework is implemented entirely in Java, and utilizes some features introduced in Java 5.0 (Tiger), namely generics, the for-each loop and the binary search methods in java.util.Arrays. It is intended to be easily expandable with additional tree data structures as well as operators such as new mutation, recombination and selection. With the introduction of adaptive parameter control however, the framework became slightly more complex.

B.2

Class Interfaces

Ideally, if you wish to try out your own mutation operator with some new feature, you would only need to implement one or more of the interfaces listed in this section. We have listed the implementations of the various interfaces such that the reader does not need to build the methods from scratch. Note that the names of the implementations has the prefixes Array, Prufer and Simple referring to the type of data structure it is used with; ArrayTree (the one used by the faster algorithm, sec. 3.2.1), PruferTree and SimpleTree, respectively.

B.2.1

Data Structures

Tree The Tree.java interface lists the methods required for any tree data structure. getSize(); returns the number of leaves in the tree. getSpeciesId(int i); returns the ids of the i’th leaf in the tree. traverse(E from_node, E to_node); returns a stack containing id of the edges on the path from from_node and to_node. 1

Available at http://www.daimi.au.dk/˜fischer/speciale/implementation/.

76

B.2. Class Interfaces All known implementing classes: - ArrayTree - PruferTree - SimpleTree Individual The Individual.java interface represents the actual individuals of the genetic algorithm. Note that there is a few more methods which are needed for adaptive parameter control. compare(Individual I); compares the trees of the two individuals. copy(); makes a copy of this individual. getFitness(); evaluates this individual using the least squares method. As a performance issue, this fitness should not be recalculated unless the tree has been changed. getTree(); returns the tree contained by this individual. mutate(); mutates the individual. recombine(Individual other); recombines the trees of the two individuals. setMutator(Operator m); sets the mutate operator. setRecombiner(Operator m); sets the recombine operator. This interface is implemented by: -

ArrayTreeIndividual PruferIndividual RandomPruferIndividual SimpleTreeIndividual

B.2.2

IndividualFactory

The main function of classes implementing the IndividualFactory.java interface is generating the initial population. The methods are: getLS(int species, DistanceMatrix dm); returns the least squares method. make(int species, thesis.LeastSquares ls); constructs new individual. Implementing classes of the interface: -

ArrayTreeFactory PruferFactory RandomPruferFactory SimpleTreeFactory SimpleTreeNjFactory

77

78

Evolutionary Framework

B.2.3

Operators

Mutator The Mutator.java interface represents the mutation operator in the frame work. It contains a single method: mutate(Tree tree); performs the mutation of the specified tree. Actual implementations of Mutator: -

ArrayMutateNNI ArrayMutateSubtree ArrayMutateSwap PruferDisplacement PruferInversion SimpleNNI SimpleSPR SimpleSwap

Recombiner The Recombiner.java is an interface for the recombination operator. It too requires only the implementation of a single method: recombine(Tree thisTree, Tree otherTree); recombines the two trees. This interface has been implemented in: - SimplePDG - ArrayTree.java contains an implementation of ArrayPDG Selection Selection.java is the interface for the selection operator. selectNewGeneration(List oldgen, int popsize); takes the current population oldgen, the requested population size popsize and returns the new population as a List. Implementations: -

RankingSelection RouletteSelection SteadyStateSelection TournamentSelection

B.2. Class Interfaces

B.2.4

Parameter Control

ParameterControl The interface ParameterControl lists the following methods: feedBack(Feedback fb); associates a so called Feedback object with the parameter control. This reports the progress of the genetic algorithm, and can be extended with additional metrics. getOperator(); returns the operator being controlled. getRate(Individual tree); return the current value of the controlled parameter. Implementations: -

ChangeMutateTypeBasedOnGens ConstantRate ImpatientRule One_FifthRuleMutate One_FifthRuleSPRMutate SelfAdaptionCrossover SelfAdaptionMutate SteadyStateCrossover

79

C Parameter Control Taxonomy Figure C.1 shows the taxonomy for parameter setting used by Michalewicz and Fogel [14, p. 286] and Eiben et al. [5]. It differs from that suggested by Ursem [22] as presented in figure 6.1. Parameter setting Before the run

During the run

Parameter tuning

Parameter control

Deterministic

Adaptive

Self−adaptive

Figure C.1: Setting the parameters of a genetic algorithm can be done either before or during the evolutionary process.

80

D Distance Matrices To compare the various least squares methods, we have used the distance matrices included in this chapter.1 The first matrices (tables D.2–D.8) have been found on-line2 , and contain actual evolutionary distances. The optimal least squares score is not known. The last matrix, in table D.9, has been created by our matrix generator and is additive; i.e. the neighbor joining method is guaranteed to find the optimal tree (see fig. D.1), and the least squares score of this tree is 0.0. Table D.1 shows the results and running times of the neighbor joining method on the matrices included in this section. Note that the least squares scores listed here are known not to be optimal (except for the auto-generated). In all cases our genetic algorithm has been able to improve the solutions found by the NJ method. Matrix

Size

LS score

Time/s

A4_EXTRA AAL_decarboxy ABA_WDS ABG_transport Adeno_E1A Adeno_E1B_55K Adeno_E3_14_5 Generated

34 37 28 39 41 28 31 30

3.9835 2.8662 0.92360 19.178 68.597 18.469 0.41160 5.1986e-9

1.24 1.33 0.729 1.70 2.01 0.707 1.09 0.817

Table D.1: Results and time usage for the neighbor joining method.

1 2

Also available from http://www.daimi.au.dk/˜fischer/speciale/distancematrices/. http://www.daimi.au.dk/˜cstorm/courses/AiBTaS_e05/project2/distance-matrices/.

81

82

Q9PVL1/1-6 Q9DGJ7/24Q9DGJ8/24O93296/1-2 A4_SAISC/2 A4_MACFA/2 A4_HUMAN/2 A4_CAVPO/2 A4_RAT/24A4_MOUSE/2 A4_PIG/24Q98SG0/23Q98SF9/23Q91963/22Q9I9E7/1-1 Q90W28/25O73683/25O93279/25Q8UUR9/26O57394/36Q9CYS4/42Q60709/42Q61482/42Q64348/42APP2_MOUSE APP2_RAT/4 Q14662/4-1 APP2_HUMAN Q13861/42Q9BT36/42APP1_MOUSE APP1_HUMAN A4_CAEEL/2 A4_DROME/2

0.00000 0.01587 0.01587 0.07407 0.06349 0.06349 0.06349 0.06349 0.06349 0.06349 0.06349 0.07937 0.09524 0.07937 0.12698 0.12698 0.15873 0.15873 0.12698 0.19048 0.46341 0.41270 0.41270 0.41270 0.41270 0.39683 0.38095 0.38095 0.38095 0.77419 0.39683 0.39683 0.58730 0.57143

0.01587 0.00000 0.00000 0.03704 0.06061 0.06061 0.06061 0.07273 0.05455 0.05455 0.06667 0.09091 0.09697 0.10494 0.13223 0.16364 0.15152 0.14545 0.15758 0.16364 0.34043 0.33129 0.33129 0.33129 0.33742 0.31902 0.32515 0.32515 0.32515 0.43548 0.52727 0.50909 0.60736 0.59509

0.01587 0.00000 0.00000 0.03704 0.06061 0.06061 0.06061 0.07273 0.05455 0.05455 0.06667 0.09091 0.09697 0.10494 0.13223 0.16364 0.15152 0.14545 0.15758 0.16364 0.34043 0.33129 0.33129 0.33129 0.33742 0.31902 0.32515 0.32515 0.32515 0.43548 0.52727 0.50909 0.60736 0.59509

0.07407 0.03704 0.03704 0.00000 0.07407 0.07407 0.07407 0.07407 0.07407 0.07407 0.07407 0.11111 0.11111 0.11111 0.07407 0.07407 0.07407 0.11111 0.07407 0.07407 0.80000 0.33333 0.33333 0.33333 0.33333 0.29630 0.29630 0.29630 0.29630 1.60000 0.25926 0.25926 0.48148 0.33333

0.06349 0.06061 0.06061 0.07407 0.00000 0.01212 0.01212 0.04242 0.03030 0.03030 0.03636 0.10909 0.11515 0.11111 0.14050 0.18182 0.17576 0.16970 0.18182 0.16970 0.32624 0.31902 0.31902 0.31902 0.32515 0.31902 0.32515 0.32515 0.32515 0.43548 0.52121 0.50303 0.60123 0.60123

0.06349 0.06061 0.06061 0.07407 0.01212 0.00000 0.00000 0.03636 0.03030 0.03030 0.04242 0.10909 0.11515 0.11111 0.14876 0.18788 0.17576 0.16970 0.18182 0.16970 0.34043 0.33129 0.33129 0.33129 0.33742 0.33129 0.32515 0.32515 0.32515 0.43548 0.52121 0.50303 0.60123 0.59509

0.06349 0.06061 0.06061 0.07407 0.01212 0.00000 0.00000 0.03636 0.03030 0.03030 0.04242 0.10909 0.11515 0.11111 0.14876 0.18788 0.17576 0.16970 0.18182 0.16970 0.34043 0.33129 0.33129 0.33129 0.33742 0.33129 0.32515 0.32515 0.32515 0.43548 0.52121 0.50303 0.60123 0.59509

0.06349 0.07273 0.07273 0.07407 0.04242 0.03636 0.03636 0.00000 0.03030 0.03030 0.03030 0.10303 0.10909 0.11728 0.14050 0.15758 0.16364 0.17576 0.17576 0.17576 0.31915 0.31288 0.31288 0.31288 0.31902 0.31288 0.31288 0.31288 0.31288 0.41935 0.51515 0.49091 0.60123 0.60123

0.06349 0.05455 0.05455 0.07407 0.03030 0.03030 0.03030 0.03030 0.00000 0.00000 0.01212 0.07879 0.08485 0.09259 0.14050 0.16364 0.15152 0.14545 0.15758 0.15758 0.31206 0.30675 0.30675 0.30675 0.31288 0.30675 0.31902 0.31902 0.31902 0.42742 0.52727 0.50909 0.60123 0.59509

0.06349 0.05455 0.05455 0.07407 0.03030 0.03030 0.03030 0.03030 0.00000 0.00000 0.01212 0.07879 0.08485 0.09259 0.14050 0.16364 0.15152 0.14545 0.15758 0.15758 0.31206 0.30675 0.30675 0.30675 0.31288 0.30675 0.31902 0.31902 0.31902 0.42742 0.52727 0.50909 0.60123 0.59509

0.06349 0.06667 0.06667 0.07407 0.03636 0.04242 0.04242 0.03030 0.01212 0.01212 0.00000 0.09091 0.09697 0.10494 0.13223 0.16364 0.15152 0.14545 0.15758 0.16970 0.31915 0.31288 0.31288 0.31288 0.31902 0.31288 0.32515 0.32515 0.32515 0.43548 0.52121 0.49697 0.59509 0.60736

0.07937 0.09091 0.09091 0.11111 0.10909 0.10909 0.10909 0.10303 0.07879 0.07879 0.09091 0.00000 0.00606 0.01852 0.12397 0.15758 0.14545 0.14545 0.15758 0.17576 0.33333 0.32515 0.32515 0.32515 0.33129 0.31902 0.31288 0.31288 0.31288 0.41935 0.53939 0.52121 0.59509 0.59509

0.09524 0.09697 0.09697 0.11111 0.11515 0.11515 0.11515 0.10909 0.08485 0.08485 0.09697 0.00606 0.00000 0.02469 0.13223 0.16364 0.14545 0.15152 0.16364 0.16970 0.33333 0.32515 0.32515 0.32515 0.33129 0.31902 0.31288 0.31288 0.31288 0.41935 0.53939 0.52121 0.59509 0.59509

0.07937 0.10494 0.10494 0.11111 0.11111 0.11111 0.11111 0.11728 0.09259 0.09259 0.10494 0.01852 0.02469 0.00000 0.13223 0.16667 0.16049 0.16049 0.17284 0.18519 0.35507 0.34375 0.34375 0.34375 0.35000 0.33750 0.33125 0.33125 0.33125 0.44628 0.53704 0.51852 0.60000 0.59627

0.12698 0.13223 0.13223 0.07407 0.14050 0.14876 0.14876 0.14050 0.14050 0.14050 0.13223 0.12397 0.13223 0.13223 0.00000 0.00000 0.04959 0.05785 0.06612 0.14876 0.35052 0.33613 0.33613 0.33613 0.33613 0.31092 0.32773 0.32773 0.32773 0.48750 0.43802 0.41322 0.54167 0.57500

0.12698 0.16364 0.16364 0.07407 0.18182 0.18788 0.18788 0.15758 0.16364 0.16364 0.16364 0.15758 0.16364 0.16667 0.00000 0.00000 0.08485 0.11515 0.10303 0.20000 0.33333 0.32515 0.32515 0.32515 0.33129 0.30675 0.32515 0.32515 0.32515 0.42742 0.50303 0.49091 0.60123 0.61350

0.15873 0.15152 0.15152 0.07407 0.17576 0.17576 0.17576 0.16364 0.15152 0.15152 0.15152 0.14545 0.14545 0.16049 0.04959 0.08485 0.00000 0.07273 0.07879 0.18182 0.34043 0.33129 0.33129 0.33129 0.33742 0.31288 0.32515 0.32515 0.32515 0.43548 0.51515 0.50303 0.60736 0.61350

0.15873 0.14545 0.14545 0.11111 0.16970 0.16970 0.16970 0.17576 0.14545 0.14545 0.14545 0.14545 0.15152 0.16049 0.05785 0.11515 0.07273 0.00000 0.08485 0.18788 0.34043 0.33742 0.33742 0.33742 0.34356 0.31902 0.33129 0.33129 0.33129 0.43548 0.51515 0.50303 0.61350 0.61350

0.12698 0.15758 0.15758 0.07407 0.18182 0.18182 0.18182 0.17576 0.15758 0.15758 0.15758 0.15758 0.16364 0.17284 0.06612 0.10303 0.07879 0.08485 0.00000 0.18788 0.34043 0.33129 0.33129 0.33129 0.33742 0.32515 0.33129 0.33129 0.33129 0.43548 0.51515 0.50303 0.59509 0.61963

0.19048 0.16364 0.16364 0.07407 0.16970 0.16970 0.16970 0.17576 0.15758 0.15758 0.16970 0.17576 0.16970 0.18519 0.14876 0.20000 0.18182 0.18788 0.18788 0.00000 0.36170 0.34969 0.34969 0.34969 0.35583 0.34356 0.34356 0.34356 0.34356 0.44355 0.52727 0.50909 0.60736 0.58282

0.46341 0.34043 0.34043 0.80000 0.32624 0.34043 0.34043 0.31915 0.31206 0.31206 0.31915 0.33333 0.33333 0.35507 0.35052 0.33333 0.34043 0.34043 0.34043 0.36170 0.00000 0.04255 0.04255 0.04255 0.04965 0.07092 0.15603 0.15603 0.15603 0.28226 0.57447 0.56028 0.67857 0.65468

0.41270 0.33129 0.33129 0.33333 0.31902 0.33129 0.33129 0.31288 0.30675 0.30675 0.31288 0.32515 0.32515 0.34375 0.33613 0.32515 0.33129 0.33742 0.33129 0.34969 0.04255 0.00000 0.00000 0.00000 0.00613 0.03067 0.11043 0.11043 0.11043 0.28226 0.52147 0.50920 0.64815 0.61491

0.41270 0.33129 0.33129 0.33333 0.31902 0.33129 0.33129 0.31288 0.30675 0.30675 0.31288 0.32515 0.32515 0.34375 0.33613 0.32515 0.33129 0.33742 0.33129 0.34969 0.04255 0.00000 0.00000 0.00000 0.00613 0.03067 0.11043 0.11043 0.11043 0.28226 0.52147 0.50920 0.64815 0.61491

0.41270 0.33129 0.33129 0.33333 0.31902 0.33129 0.33129 0.31288 0.30675 0.30675 0.31288 0.32515 0.32515 0.34375 0.33613 0.32515 0.33129 0.33742 0.33129 0.34969 0.04255 0.00000 0.00000 0.00000 0.00613 0.03067 0.11043 0.11043 0.11043 0.28226 0.52147 0.50920 0.64815 0.61491

0.41270 0.33742 0.33742 0.33333 0.32515 0.33742 0.33742 0.31902 0.31288 0.31288 0.31902 0.33129 0.33129 0.35000 0.33613 0.33129 0.33742 0.34356 0.33742 0.35583 0.04965 0.00613 0.00613 0.00613 0.00000 0.03681 0.11656 0.11656 0.11656 0.29032 0.51534 0.50307 0.64815 0.60870

0.39683 0.31902 0.31902 0.29630 0.31902 0.33129 0.33129 0.31288 0.30675 0.30675 0.31288 0.31902 0.31902 0.33750 0.31092 0.30675 0.31288 0.31902 0.32515 0.34356 0.07092 0.03067 0.03067 0.03067 0.03681 0.00000 0.09202 0.09202 0.09202 0.28226 0.52147 0.50920 0.64815 0.60870

0.38095 0.32515 0.32515 0.29630 0.32515 0.32515 0.32515 0.31288 0.31902 0.31902 0.32515 0.31288 0.31288 0.33125 0.32773 0.32515 0.32515 0.33129 0.33129 0.34356 0.15603 0.11043 0.11043 0.11043 0.11656 0.09202 0.00000 0.00000 0.00000 0.20968 0.53374 0.52147 0.62346 0.61491

0.38095 0.32515 0.32515 0.29630 0.32515 0.32515 0.32515 0.31288 0.31902 0.31902 0.32515 0.31288 0.31288 0.33125 0.32773 0.32515 0.32515 0.33129 0.33129 0.34356 0.15603 0.11043 0.11043 0.11043 0.11656 0.09202 0.00000 0.00000 0.00000 0.20968 0.53374 0.52147 0.62346 0.61491

0.38095 0.32515 0.32515 0.29630 0.32515 0.32515 0.32515 0.31288 0.31902 0.31902 0.32515 0.31288 0.31288 0.33125 0.32773 0.32515 0.32515 0.33129 0.33129 0.34356 0.15603 0.11043 0.11043 0.11043 0.11656 0.09202 0.00000 0.00000 0.00000 0.20968 0.53374 0.52147 0.62346 0.61491

0.77419 0.43548 0.43548 1.60000 0.43548 0.43548 0.43548 0.41935 0.42742 0.42742 0.43548 0.41935 0.41935 0.44628 0.48750 0.42742 0.43548 0.43548 0.43548 0.44355 0.28226 0.28226 0.28226 0.28226 0.29032 0.28226 0.20968 0.20968 0.20968 0.00000 0.66129 0.64516 0.68293 0.70732

0.39683 0.52727 0.52727 0.25926 0.52121 0.52121 0.52121 0.51515 0.52727 0.52727 0.52121 0.53939 0.53939 0.53704 0.43802 0.50303 0.51515 0.51515 0.51515 0.52727 0.57447 0.52147 0.52147 0.52147 0.51534 0.52147 0.53374 0.53374 0.53374 0.66129 0.00000 0.04819 0.65644 0.64417

0.39683 0.50909 0.50909 0.25926 0.50303 0.50303 0.50303 0.49091 0.50909 0.50909 0.49697 0.52121 0.52121 0.51852 0.41322 0.49091 0.50303 0.50303 0.50303 0.50909 0.56028 0.50920 0.50920 0.50920 0.50307 0.50920 0.52147 0.52147 0.52147 0.64516 0.04819 0.00000 0.66871 0.63190

0.58730 0.60736 0.60736 0.48148 0.60123 0.60123 0.60123 0.60123 0.60123 0.60123 0.59509 0.59509 0.59509 0.60000 0.54167 0.60123 0.60736 0.61350 0.59509 0.60736 0.67857 0.64815 0.64815 0.64815 0.64815 0.64815 0.62346 0.62346 0.62346 0.68293 0.65644 0.66871 0.00000 0.57407

0.57143 0.59509 0.59509 0.33333 0.60123 0.59509 0.59509 0.60123 0.59509 0.59509 0.60736 0.59509 0.59509 0.59627 0.57500 0.61350 0.61350 0.61350 0.61963 0.58282 0.65468 0.61491 0.61491 0.61491 0.60870 0.60870 0.61491 0.61491 0.61491 0.70732 0.64417 0.63190 0.57407 0.00000

Table D.2: The A4_EXTRA distance matrix with 34 species. 0.00000 0.05046 0.03211 0.55063 0.57339 0.60550 0.61927 0.70892 0.65138 0.65899 0.66055 0.58257 0.62673 0.57477 0.57477 0.57477 0.59346 0.60748 0.60280 0.62212 0.62212 0.62673 0.60185 0.65957 0.60369 0.59908 0.62673 0.63134 0.56198 0.60369 0.61856 0.56944 0.64352 0.69266 0.72350 0.69417 0.73023

0.05046 0.00000 0.01762 0.53797 0.56388 0.61674 0.59471 0.68468 0.63436 0.63274 0.63436 0.56388 0.61504 0.59193 0.59193 0.59193 0.60987 0.61883 0.61883 0.62389 0.62389 0.62832 0.60444 0.62766 0.61504 0.61062 0.62389 0.62832 0.55372 0.58407 0.58491 0.56944 0.62667 0.69163 0.71681 0.69767 0.73661

0.03211 0.01762 0.00000 0.54430 0.56388 0.61674 0.59471 0.68919 0.63877 0.63717 0.63877 0.56388 0.62389 0.58744 0.58744 0.58744 0.60987 0.61883 0.61435 0.62389 0.62389 0.62832 0.60889 0.62766 0.61504 0.61062 0.62389 0.62832 0.55372 0.58850 0.59434 0.56944 0.62667 0.68722 0.72124 0.70233 0.73214

0.55063 0.53797 0.54430 0.00000 0.16456 0.62025 0.50633 0.67320 0.61392 0.60510 0.60127 0.59494 0.63057 0.64968 0.64331 0.64331 0.67516 0.67516 0.68153 0.69427 0.69427 0.69427 0.67516 0.68132 0.67516 0.66879 0.72611 0.72611 0.54545 0.57962 0.75676 0.62500 0.66879 0.70886 0.65190 0.75168 0.77070

0.57339 0.56388 0.56388 0.16456 0.00000 0.59912 0.54825 0.63274 0.59740 0.60000 0.58442 0.58150 0.65198 0.66518 0.66071 0.66071 0.68750 0.69196 0.68750 0.70354 0.70354 0.70354 0.68000 0.65957 0.65929 0.66372 0.68584 0.68584 0.53719 0.56637 0.61321 0.62500 0.64889 0.70485 0.65487 0.73148 0.73661

0.60550 0.61674 0.61674 0.62025 0.59912 0.00000 0.62115 0.69369 0.62555 0.65044 0.64317 0.64758 0.61947 0.68161 0.67713 0.68161 0.68610 0.69507 0.69058 0.66814 0.66814 0.69469 0.67111 0.62766 0.67257 0.67699 0.65487 0.65487 0.54545 0.57080 0.59434 0.65278 0.67556 0.69604 0.66814 0.71628 0.73661

0.61927 0.59471 0.59471 0.50633 0.54825 0.62115 0.00000 0.47982 0.42544 0.40529 0.39912 0.55066 0.64758 0.71429 0.71429 0.71429 0.70089 0.70089 0.70536 0.67257 0.67257 0.68142 0.67111 0.69149 0.67699 0.67699 0.68142 0.68584 0.59504 0.60177 0.60377 0.62500 0.69333 0.73568 0.67699 0.78241 0.75000

0.70892 0.68468 0.68919 0.67320 0.63274 0.69369 0.47982 0.00000 0.19027 0.14222 0.11504 0.60811 0.65766 0.70776 0.70776 0.70776 0.73059 0.73059 0.73059 0.73756 0.73303 0.71946 0.74091 0.74157 0.70588 0.70588 0.70136 0.70136 0.67241 0.64706 0.62264 0.70149 0.70455 0.75785 0.69231 0.79147 0.75799

0.65138 0.63436 0.63877 0.61392 0.59740 0.62555 0.42544 0.19027 0.00000 0.10000 0.08658 0.55947 0.61233 0.68750 0.68750 0.68750 0.67857 0.68304 0.68304 0.66814 0.67257 0.66372 0.67111 0.64894 0.64602 0.64159 0.64602 0.64602 0.57025 0.58850 0.61321 0.59722 0.68889 0.73128 0.67257 0.75463 0.71875

0.65899 0.63274 0.63717 0.60510 0.60000 0.65044 0.40529 0.14222 0.10000 0.00000 0.02609 0.55310 0.63717 0.67713 0.67713 0.67713 0.69507 0.69058 0.69058 0.68889 0.68444 0.67556 0.69196 0.67021 0.67556 0.67111 0.66667 0.66667 0.55000 0.58222 0.61321 0.61111 0.68304 0.74336 0.67556 0.75349 0.72197

0.66055 0.63436 0.63877 0.60127 0.58442 0.64317 0.39912 0.11504 0.08658 0.02609 0.00000 0.55507 0.62555 0.67857 0.67857 0.67857 0.69196 0.68750 0.68750 0.69027 0.68584 0.67699 0.69333 0.67021 0.67257 0.66814 0.66372 0.66372 0.55372 0.58407 0.61321 0.61111 0.67556 0.74009 0.67257 0.75463 0.72321

0.58257 0.56388 0.56388 0.59494 0.58150 0.64758 0.55066 0.60811 0.55947 0.55310 0.55507 0.00000 0.62389 0.69058 0.68610 0.69058 0.71300 0.69955 0.69955 0.69027 0.69027 0.69027 0.67111 0.65957 0.68142 0.68142 0.69469 0.69469 0.61157 0.60619 0.58491 0.59722 0.66222 0.68282 0.68142 0.75349 0.75446

0.62673 0.61504 0.62389 0.63057 0.65198 0.61947 0.64758 0.65766 0.61233 0.63717 0.62555 0.62389 0.00000 0.69058 0.69507 0.69507 0.68161 0.69955 0.69507 0.69778 0.69778 0.68444 0.68304 0.70968 0.65778 0.66222 0.67111 0.67111 0.53719 0.56637 0.65094 0.69014 0.68304 0.67699 0.63111 0.69444 0.70536

0.57477 0.59193 0.58744 0.64968 0.66518 0.68161 0.71429 0.70776 0.68750 0.67713 0.67857 0.69058 0.69058 0.00000 0.02222 0.02222 0.18750 0.24664 0.25112 0.56757 0.56757 0.58108 0.58559 0.59574 0.52252 0.53153 0.56054 0.55605 0.61983 0.65315 0.66667 0.56944 0.61883 0.66071 0.73094 0.68837 0.75000

0.57477 0.59193 0.58744 0.64331 0.66071 0.67713 0.71429 0.70776 0.68750 0.67713 0.67857 0.68610 0.69507 0.02222 0.00000 0.00889 0.20536 0.25561 0.26009 0.56757 0.56757 0.58108 0.58559 0.59574 0.52703 0.53604 0.56054 0.56054 0.61157 0.64865 0.66667 0.56944 0.61883 0.66518 0.72646 0.68372 0.74545

0.57477 0.59193 0.58744 0.64331 0.66071 0.68161 0.71429 0.70776 0.68750 0.67713 0.67857 0.69058 0.69507 0.02222 0.00889 0.00000 0.20089 0.26009 0.26457 0.56757 0.56757 0.58108 0.58559 0.59574 0.52703 0.53604 0.56054 0.56054 0.61157 0.64865 0.66667 0.56944 0.62780 0.66964 0.72646 0.68372 0.74545

0.59346 0.60987 0.60987 0.67516 0.68750 0.68610 0.70089 0.73059 0.67857 0.69507 0.69196 0.71300 0.68161 0.18750 0.20536 0.20089 0.00000 0.22768 0.22768 0.57848 0.57848 0.58744 0.57399 0.62105 0.52018 0.52018 0.58108 0.58108 0.62810 0.64865 0.67647 0.58904 0.60360 0.66368 0.72646 0.68837 0.75000

0.60748 0.61883 0.61883 0.67516 0.69196 0.69507 0.70089 0.73059 0.68304 0.69058 0.68750 0.69955 0.69955 0.24664 0.25561 0.26009 0.22768 0.00000 0.01333 0.56502 0.56502 0.56054 0.56951 0.63158 0.54260 0.54260 0.57207 0.56757 0.63636 0.67568 0.70588 0.57534 0.61712 0.66368 0.74888 0.71628 0.74545

0.60280 0.61883 0.61435 0.68153 0.68750 0.69058 0.70536 0.73059 0.68304 0.69058 0.68750 0.69955 0.69507 0.25112 0.26009 0.26457 0.22768 0.01333 0.00000 0.56054 0.56054 0.56502 0.56502 0.61053 0.53812 0.53812 0.56757 0.56306 0.62810 0.67117 0.70588 0.57534 0.61712 0.66368 0.73991 0.70698 0.74091

0.62212 0.62389 0.62389 0.69427 0.70354 0.66814 0.67257 0.73756 0.66814 0.68889 0.69027 0.69027 0.69778 0.56757 0.56757 0.56757 0.57848 0.56502 0.56054 0.00000 0.00870 0.14978 0.17467 0.20000 0.30837 0.30837 0.40708 0.40708 0.66667 0.65778 0.65094 0.12329 0.65179 0.68584 0.72124 0.67442 0.74439

0.62212 0.62389 0.62389 0.69427 0.70354 0.66814 0.67257 0.73303 0.67257 0.68444 0.68584 0.69027 0.69778 0.56757 0.56757 0.56757 0.57848 0.56502 0.56054 0.00870 0.00000 0.14537 0.17467 0.20000 0.30396 0.30396 0.40265 0.40265 0.65833 0.65333 0.65094 0.12329 0.65179 0.68584 0.72124 0.66977 0.74439

0.62673 0.62832 0.62832 0.69427 0.70354 0.69469 0.68142 0.71946 0.66372 0.67556 0.67699 0.69027 0.68444 0.58108 0.58108 0.58108 0.58744 0.56054 0.56502 0.14978 0.14537 0.00000 0.20354 0.25263 0.30837 0.29075 0.39381 0.39381 0.65833 0.65333 0.63208 0.12329 0.65179 0.67699 0.72566 0.63256 0.73543

0.60185 0.60444 0.60889 0.67516 0.68000 0.67111 0.67111 0.74091 0.67111 0.69196 0.69333 0.67111 0.68304 0.58559 0.58559 0.58559 0.57399 0.56951 0.56502 0.17467 0.17467 0.20354 0.00000 0.23158 0.30531 0.29204 0.40000 0.40000 0.65000 0.66071 0.63810 0.06849 0.63839 0.68444 0.72000 0.67442 0.74324

0.65957 0.62766 0.62766 0.68132 0.65957 0.62766 0.69149 0.74157 0.64894 0.67021 0.67021 0.65957 0.70968 0.59574 0.59574 0.59574 0.62105 0.63158 0.61053 0.20000 0.20000 0.25263 0.23158 0.00000 0.33684 0.33684 0.41489 0.41489 0.65556 0.65591 0.50000 0.17808 0.69149 0.69149 0.66316 0.59302 0.76344

0.60369 0.61504 0.61504 0.67516 0.65929 0.67257 0.67699 0.70588 0.64602 0.67556 0.67257 0.68142 0.65778 0.52252 0.52703 0.52703 0.52018 0.54260 0.53812 0.30837 0.30396 0.30837 0.30531 0.33684 0.00000 0.02643 0.41150 0.41150 0.65000 0.64444 0.66038 0.31507 0.62500 0.69912 0.71239 0.65116 0.70404

0.59908 0.61062 0.61062 0.66879 0.66372 0.67699 0.67699 0.70588 0.64159 0.67111 0.66814 0.68142 0.66222 0.53153 0.53604 0.53604 0.52018 0.54260 0.53812 0.30837 0.30396 0.29075 0.29204 0.33684 0.02643 0.00000 0.41150 0.41150 0.63333 0.65333 0.65094 0.31507 0.62946 0.69912 0.72124 0.64651 0.70404

0.62673 0.62389 0.62389 0.72611 0.68584 0.65487 0.68142 0.70136 0.64602 0.66667 0.66372 0.69469 0.67111 0.56054 0.56054 0.56054 0.58108 0.57207 0.56757 0.40708 0.40265 0.39381 0.40000 0.41489 0.41150 0.41150 0.00000 0.01304 0.65000 0.64000 0.64151 0.31944 0.64444 0.71930 0.71239 0.64651 0.69507

0.63134 0.62832 0.62832 0.72611 0.68584 0.65487 0.68584 0.70136 0.64602 0.66667 0.66372 0.69469 0.67111 0.55605 0.56054 0.56054 0.58108 0.56757 0.56306 0.40708 0.40265 0.39381 0.40000 0.41489 0.41150 0.41150 0.01304 0.00000 0.65000 0.64000 0.64151 0.31944 0.64000 0.71930 0.71239 0.64651 0.69955

0.56198 0.55372 0.55372 0.54545 0.53719 0.54545 0.59504 0.67241 0.57025 0.55000 0.55372 0.61157 0.53719 0.61983 0.61157 0.61157 0.62810 0.63636 0.62810 0.66667 0.65833 0.65833 0.65000 0.65556 0.65000 0.63333 0.65000 0.65000 0.00000 0.14876 0.00000 0.64789 0.65289 0.66116 0.59504 0.70796 0.70248

Table D.3: The AAL_decarboxy distance matrix with 37 species.

0.60369 0.58407 0.58850 0.57962 0.56637 0.57080 0.60177 0.64706 0.58850 0.58222 0.58407 0.60619 0.56637 0.65315 0.64865 0.64865 0.64865 0.67568 0.67117 0.65778 0.65333 0.65333 0.66071 0.65591 0.64444 0.65333 0.64000 0.64000 0.14876 0.00000 0.15094 0.64789 0.66071 0.68584 0.64000 0.75349 0.73214

0.61856 0.58491 0.59434 0.75676 0.61321 0.59434 0.60377 0.62264 0.61321 0.61321 0.61321 0.58491 0.65094 0.66667 0.66667 0.66667 0.67647 0.70588 0.70588 0.65094 0.65094 0.63208 0.63810 0.50000 0.66038 0.65094 0.64151 0.64151 0.00000 0.15094 0.00000 0.00000 0.63462 0.70755 0.66667 0.76699 0.74038

0.56944 0.56944 0.56944 0.62500 0.62500 0.65278 0.62500 0.70149 0.59722 0.61111 0.61111 0.59722 0.69014 0.56944 0.56944 0.56944 0.58904 0.57534 0.57534 0.12329 0.12329 0.12329 0.06849 0.17808 0.31507 0.31507 0.31944 0.31944 0.64789 0.64789 0.00000 0.00000 0.68056 0.72222 0.58904 0.61194 0.71831

0.64352 0.62667 0.62667 0.66879 0.64889 0.67556 0.69333 0.70455 0.68889 0.68304 0.67556 0.66222 0.68304 0.61883 0.61883 0.62780 0.60360 0.61712 0.61712 0.65179 0.65179 0.65179 0.63839 0.69149 0.62500 0.62946 0.64444 0.64000 0.65289 0.66071 0.63462 0.68056 0.00000 0.58407 0.69333 0.67290 0.74324

0.69266 0.69163 0.68722 0.70886 0.70485 0.69604 0.73568 0.75785 0.73128 0.74336 0.74009 0.68282 0.67699 0.66071 0.66518 0.66964 0.66368 0.66368 0.66368 0.68584 0.68584 0.67699 0.68444 0.69149 0.69912 0.69912 0.71930 0.71930 0.66116 0.68584 0.70755 0.72222 0.58407 0.00000 0.74890 0.72558 0.75000

0.72350 0.71681 0.72124 0.65190 0.65487 0.66814 0.67699 0.69231 0.67257 0.67556 0.67257 0.68142 0.63111 0.73094 0.72646 0.72646 0.72646 0.74888 0.73991 0.72124 0.72124 0.72566 0.72000 0.66316 0.71239 0.72124 0.71239 0.71239 0.59504 0.64000 0.66667 0.58904 0.69333 0.74890 0.00000 0.74766 0.72646

0.69417 0.69767 0.70233 0.75168 0.73148 0.71628 0.78241 0.79147 0.75463 0.75349 0.75463 0.75349 0.69444 0.68837 0.68372 0.68372 0.68837 0.71628 0.70698 0.67442 0.66977 0.63256 0.67442 0.59302 0.65116 0.64651 0.64651 0.64651 0.70796 0.75349 0.76699 0.61194 0.67290 0.72558 0.74766 0.00000 0.76056

0.73023 0.73661 0.73214 0.77070 0.73661 0.73661 0.75000 0.75799 0.71875 0.72197 0.72321 0.75446 0.70536 0.75000 0.74545 0.74545 0.75000 0.74545 0.74091 0.74439 0.74439 0.73543 0.74324 0.76344 0.70404 0.70404 0.69507 0.69955 0.70248 0.73214 0.74038 0.71831 0.74324 0.75000 0.72646 0.76056 0.00000

Distance Matrices

Q9Z632/3-2 Q92A23/3-2 Q8Y5R4/3-2 Q8VSR6/1-1 ALDC_BACSU Q97EY8/4-2 Q9KRP7/24Q93A74/20ALDC_KLETE Q59340/20ALDC_ENTAE Q53405/67ALDC_BACBR Q99S63/1-2 Q8NVC2/1-2 Q8GAX3/1-2 Q8CMZ2/1-2 Q99R36/1-2 Q8NUN0/1-2 Q97Q39/5-2 Q8DPB9/5-2 Q8L208/1-2 Q8DTA7/1-2 P95751/1-9 ALDC_LACLA ALDC_LACLC Q8E4X8/1-2 Q8DZB8/1-2 Q8TJN5/1-1 Q8PZ55/53Q8TJN4/51Q8RP85/1-7 Q88VL7/1-2 P94897/1-2 Q59265/26Q9CGI9/1-2 Q8EVE0/2-2

0.00000 0.00000 0.20370 0.20370 0.20370 0.16667 0.20370 0.22222 0.27778 0.27778 0.27778 0.20370 0.31481 0.22222 0.24390 0.22222 0.27778 0.25926 0.24074 0.31481 0.31481 0.35185 0.33333 0.29630 0.38889 0.35185 0.40741 0.42593

0.00000 0.00000 0.20370 0.20370 0.20370 0.16667 0.20370 0.22222 0.27778 0.27778 0.27778 0.20370 0.31481 0.22222 0.24390 0.22222 0.27778 0.25926 0.24074 0.31481 0.31481 0.35185 0.33333 0.29630 0.38889 0.35185 0.40741 0.42593

0.20370 0.20370 0.00000 0.01852 0.01852 0.16667 0.14815 0.16667 0.27778 0.27778 0.25926 0.20370 0.25926 0.16667 0.21951 0.18519 0.24074 0.22222 0.25926 0.27778 0.31481 0.25926 0.27778 0.31481 0.33333 0.27778 0.48148 0.50000

0.20370 0.20370 0.01852 0.00000 0.00000 0.16667 0.14815 0.16667 0.27778 0.27778 0.25926 0.20370 0.25926 0.16667 0.21951 0.18519 0.24074 0.22222 0.25926 0.27778 0.31481 0.25926 0.27778 0.31481 0.33333 0.27778 0.48148 0.50000

0.20370 0.20370 0.01852 0.00000 0.00000 0.16667 0.14815 0.16667 0.27778 0.27778 0.25926 0.20370 0.25926 0.16667 0.21951 0.18519 0.24074 0.22222 0.25926 0.27778 0.31481 0.25926 0.27778 0.31481 0.33333 0.27778 0.48148 0.50000

0.16667 0.16667 0.16667 0.16667 0.16667 0.00000 0.03704 0.05556 0.14815 0.31481 0.29630 0.22222 0.29630 0.24074 0.19512 0.22222 0.25926 0.24074 0.24074 0.29630 0.27778 0.31481 0.33333 0.35185 0.35185 0.37037 0.44444 0.42593

0.20370 0.20370 0.14815 0.14815 0.14815 0.03704 0.00000 0.01852 0.16667 0.31481 0.29630 0.25926 0.29630 0.25926 0.19512 0.20370 0.25926 0.24074 0.24074 0.29630 0.31481 0.31481 0.31481 0.35185 0.37037 0.35185 0.48148 0.46296

0.22222 0.22222 0.16667 0.16667 0.16667 0.05556 0.01852 0.00000 0.18519 0.33333 0.31481 0.27778 0.31481 0.27778 0.21951 0.22222 0.27778 0.25926 0.25926 0.31481 0.33333 0.33333 0.31481 0.37037 0.38889 0.37037 0.50000 0.48148

0.27778 0.27778 0.27778 0.27778 0.27778 0.14815 0.16667 0.18519 0.00000 0.38889 0.40741 0.33333 0.37037 0.35185 0.26829 0.31481 0.35185 0.33333 0.37037 0.37037 0.33333 0.46296 0.37037 0.46296 0.37037 0.42593 0.51852 0.50000

0.27778 0.27778 0.27778 0.27778 0.27778 0.31481 0.31481 0.33333 0.38889 0.00000 0.18519 0.20370 0.24074 0.25926 0.26829 0.31481 0.24074 0.25926 0.29630 0.25926 0.33333 0.38889 0.35185 0.37037 0.44444 0.42593 0.40741 0.44444

0.27778 0.27778 0.25926 0.25926 0.25926 0.29630 0.29630 0.31481 0.40741 0.18519 0.00000 0.12963 0.20370 0.27778 0.21951 0.24074 0.25926 0.27778 0.25926 0.29630 0.33333 0.33333 0.38889 0.33333 0.38889 0.35185 0.44444 0.46296

0.20370 0.20370 0.20370 0.20370 0.20370 0.22222 0.25926 0.27778 0.33333 0.20370 0.12963 0.00000 0.16667 0.16667 0.21951 0.18519 0.20370 0.22222 0.24074 0.27778 0.27778 0.29630 0.35185 0.27778 0.33333 0.37037 0.40741 0.46296

0.31481 0.31481 0.25926 0.25926 0.25926 0.29630 0.29630 0.31481 0.37037 0.24074 0.20370 0.16667 0.00000 0.25926 0.19512 0.24074 0.20370 0.22222 0.24074 0.20370 0.27778 0.25926 0.31481 0.31481 0.40741 0.40741 0.38889 0.44444

0.22222 0.22222 0.16667 0.16667 0.16667 0.24074 0.25926 0.27778 0.35185 0.25926 0.27778 0.16667 0.25926 0.00000 0.31707 0.25926 0.27778 0.29630 0.31481 0.24074 0.27778 0.24074 0.24074 0.31481 0.37037 0.37037 0.40741 0.44444

0.24390 0.24390 0.21951 0.21951 0.21951 0.19512 0.19512 0.21951 0.26829 0.26829 0.21951 0.21951 0.19512 0.31707 0.00000 0.14634 0.17073 0.17073 0.21951 0.19512 0.24390 0.26829 0.36585 0.36585 0.43902 0.43902 0.43902 0.46341

0.22222 0.22222 0.18519 0.18519 0.18519 0.22222 0.20370 0.22222 0.31481 0.31481 0.24074 0.18519 0.24074 0.25926 0.14634 0.00000 0.16667 0.14815 0.18519 0.25926 0.29630 0.27778 0.25926 0.24074 0.35185 0.31481 0.46296 0.50000

0.27778 0.27778 0.24074 0.24074 0.24074 0.25926 0.25926 0.27778 0.35185 0.24074 0.25926 0.20370 0.20370 0.27778 0.17073 0.16667 0.00000 0.03704 0.22222 0.16667 0.20370 0.24074 0.25926 0.27778 0.40741 0.35185 0.40741 0.44444

0.25926 0.25926 0.22222 0.22222 0.22222 0.24074 0.24074 0.25926 0.33333 0.25926 0.27778 0.22222 0.22222 0.29630 0.17073 0.14815 0.03704 0.00000 0.24074 0.20370 0.24074 0.25926 0.25926 0.29630 0.40741 0.35185 0.42593 0.46296

0.24074 0.24074 0.25926 0.25926 0.25926 0.24074 0.24074 0.25926 0.37037 0.29630 0.25926 0.24074 0.24074 0.31481 0.21951 0.18519 0.22222 0.24074 0.00000 0.24074 0.27778 0.29630 0.33333 0.27778 0.40741 0.38889 0.44444 0.44444

0.31481 0.31481 0.27778 0.27778 0.27778 0.29630 0.29630 0.31481 0.37037 0.25926 0.29630 0.27778 0.20370 0.24074 0.19512 0.25926 0.16667 0.20370 0.24074 0.00000 0.11111 0.22222 0.27778 0.31481 0.44444 0.40741 0.29630 0.35185

0.31481 0.31481 0.31481 0.31481 0.31481 0.27778 0.31481 0.33333 0.33333 0.33333 0.33333 0.27778 0.27778 0.27778 0.24390 0.29630 0.20370 0.24074 0.27778 0.11111 0.00000 0.25926 0.33333 0.35185 0.38889 0.42593 0.33333 0.37037

0.35185 0.35185 0.25926 0.25926 0.25926 0.31481 0.31481 0.33333 0.46296 0.38889 0.33333 0.29630 0.25926 0.24074 0.26829 0.27778 0.24074 0.25926 0.29630 0.22222 0.25926 0.00000 0.33333 0.29630 0.48148 0.42593 0.44444 0.50000

0.33333 0.33333 0.27778 0.27778 0.27778 0.33333 0.31481 0.31481 0.37037 0.35185 0.38889 0.35185 0.31481 0.24074 0.36585 0.25926 0.25926 0.25926 0.33333 0.27778 0.33333 0.33333 0.00000 0.37037 0.42593 0.35185 0.46296 0.53704

0.29630 0.29630 0.31481 0.31481 0.31481 0.35185 0.35185 0.37037 0.46296 0.37037 0.33333 0.27778 0.31481 0.31481 0.36585 0.24074 0.27778 0.29630 0.27778 0.31481 0.35185 0.29630 0.37037 0.00000 0.48148 0.44444 0.48148 0.51852

0.38889 0.38889 0.33333 0.33333 0.33333 0.35185 0.37037 0.38889 0.37037 0.44444 0.38889 0.33333 0.40741 0.37037 0.43902 0.35185 0.40741 0.40741 0.40741 0.44444 0.38889 0.48148 0.42593 0.48148 0.00000 0.22222 0.55556 0.55556

0.35185 0.35185 0.27778 0.27778 0.27778 0.37037 0.35185 0.37037 0.42593 0.42593 0.35185 0.37037 0.40741 0.37037 0.43902 0.31481 0.35185 0.35185 0.38889 0.40741 0.42593 0.42593 0.35185 0.44444 0.22222 0.00000 0.53704 0.53704

0.40741 0.40741 0.48148 0.48148 0.48148 0.44444 0.48148 0.50000 0.51852 0.40741 0.44444 0.40741 0.38889 0.40741 0.43902 0.46296 0.40741 0.42593 0.44444 0.29630 0.33333 0.44444 0.46296 0.48148 0.55556 0.53704 0.00000 0.24074

0.42593 0.42593 0.50000 0.50000 0.50000 0.42593 0.46296 0.48148 0.50000 0.44444 0.46296 0.46296 0.44444 0.44444 0.46341 0.50000 0.44444 0.46296 0.44444 0.35185 0.37037 0.50000 0.53704 0.51852 0.55556 0.53704 0.24074 0.00000

Distance Matrices

Q41300/176 Q8H0L9/181 Q9ZRB6/24O82575/25ASR1_LYCES Q9ZRB5/25Q40165/27ASR2_LYCES ASR3_LYCES Q41096/24Q41095/47O24314/62Q41087/40Q9FRV9/162 Q9LUX0/1-4 Q94G23/63Q93WZ6/107 O50000/114 O65176/57Q948L3/55O49149/51Q9SEW0/56Q93VD6/29Q39537/10Q8S2D0/19Q8S2C9/24Q9FRX0/41Q94DL7/89-

Table D.4: The ABA_WDS distance matrix with 28 species. Q9RNX2/11Q9RNX1/11Q9JY63/11Q9JT47/11Q9CLU9/12Q9KTM2/25Q8DBH6/19Q87MC7/19Q9KC22/13Q8ESC8/11Q8XP93/13Q895F3/19Q8RG48/11Q8EMQ1/13Q8ELE5/9-5 Q8ENA6/3-4 Q9KEI6/19Q99RG0/14Q8NUW5/14Q93KC7/1-1 Q8RFU3/17Q891Q3/21Q9KTN8/19Q87RW4/19Q8DF75/2-4 Q8EKD5/12Q8ZH97/13Q8CKJ9/1-4 Q9AA61/39Q9A7U5/48Q8X8P0/14Q8X8N7/1-2 ABGT_ECOLI Q8FHQ4/34Q897Y6/12Q8A9R4/1-4 Q8NU46/27Q8FUB5/53Q8DF74/3-7

0.00000 0.00786 0.02750 0.02161 0.17878 0.31496 0.33268 0.33661 0.49597 0.49798 0.48406 0.47896 0.47992 0.49495 0.56478 0.55984 0.51116 0.61771 0.61972 0.51685 0.60532 0.56680 0.54819 0.54016 0.52038 0.56426 0.62777 0.62680 0.60236 0.59319 0.59761 0.60699 0.60121 0.65517 0.64471 0.60297 0.62777 0.63052 0.64935

0.00786 0.00000 0.02750 0.02161 0.18468 0.31693 0.33465 0.33661 0.50000 0.50000 0.48606 0.47896 0.48394 0.49697 0.56275 0.55984 0.50710 0.62374 0.62575 0.51685 0.60736 0.56883 0.54819 0.54217 0.52278 0.56426 0.62978 0.62887 0.60433 0.59519 0.60159 0.60699 0.60324 0.65517 0.64671 0.60510 0.62777 0.63052 0.64935

0.02750 0.02750 0.00000 0.00982 0.17485 0.31693 0.33858 0.34055 0.50000 0.49190 0.48805 0.47695 0.47791 0.49091 0.56275 0.55781 0.50304 0.61368 0.61569 0.51685 0.60532 0.57490 0.54418 0.53815 0.51799 0.56827 0.63179 0.63093 0.60630 0.59719 0.60956 0.60699 0.60729 0.65517 0.64271 0.60722 0.62777 0.62651 0.64935

0.02161 0.02161 0.00982 0.00000 0.17485 0.31299 0.33465 0.33858 0.49798 0.49393 0.48805 0.47695 0.48193 0.49091 0.56275 0.56187 0.50507 0.62173 0.62173 0.51685 0.60327 0.57085 0.54418 0.53815 0.51799 0.56426 0.62777 0.62680 0.60433 0.59519 0.60159 0.60699 0.60324 0.65517 0.64471 0.60510 0.62374 0.62651 0.64935

0.17878 0.18468 0.17485 0.17485 0.00000 0.31693 0.32874 0.32480 0.48185 0.47976 0.49402 0.48297 0.47590 0.49091 0.57287 0.56592 0.49696 0.61569 0.61569 0.51124 0.60123 0.56883 0.54819 0.54618 0.52518 0.55422 0.60563 0.60412 0.59843 0.59920 0.60159 0.58515 0.59312 0.62069 0.64870 0.60085 0.62978 0.63253 0.64935

0.31496 0.31693 0.31693 0.31299 0.31693 0.00000 0.10236 0.10827 0.48485 0.49696 0.47904 0.46988 0.47189 0.50505 0.55984 0.54970 0.51829 0.62097 0.62500 0.51685 0.61885 0.57606 0.55533 0.55533 0.54087 0.55734 0.63508 0.63430 0.58974 0.59639 0.59363 0.63158 0.61258 0.62069 0.64471 0.61996 0.60282 0.61569 0.67532

0.33268 0.33465 0.33858 0.33465 0.32874 0.10236 0.00000 0.05906 0.47677 0.48479 0.47305 0.46386 0.45984 0.50505 0.55172 0.56389 0.51220 0.61694 0.61895 0.50562 0.60656 0.56795 0.55131 0.56137 0.54808 0.55131 0.63306 0.63223 0.59369 0.59237 0.62151 0.61842 0.61460 0.62069 0.65070 0.62633 0.60081 0.61167 0.70130

0.33661 0.33661 0.34055 0.33858 0.32480 0.10827 0.05906 0.00000 0.48485 0.49087 0.46906 0.46586 0.46988 0.51111 0.54970 0.55984 0.51829 0.62500 0.62702 0.51124 0.61066 0.57606 0.55936 0.55332 0.54327 0.56137 0.62298 0.62190 0.59566 0.59237 0.62151 0.60526 0.61258 0.62069 0.64072 0.62633 0.60887 0.61972 0.70130

0.49597 0.50000 0.50000 0.49798 0.48185 0.48485 0.47677 0.48485 0.00000 0.27789 0.47059 0.47047 0.45175 0.48980 0.53593 0.53265 0.49590 0.58248 0.58452 0.47753 0.58091 0.54098 0.57085 0.58621 0.57488 0.55936 0.60976 0.61250 0.60081 0.61789 0.58964 0.62882 0.61460 0.68966 0.62805 0.63848 0.59838 0.61179 0.64000

0.49798 0.50000 0.49190 0.49393 0.47976 0.49696 0.48479 0.49087 0.27789 0.00000 0.45010 0.46856 0.45603 0.47325 0.53061 0.53484 0.48163 0.59714 0.59714 0.47191 0.59544 0.53086 0.53252 0.54379 0.52913 0.54970 0.63878 0.64017 0.58788 0.59267 0.60643 0.60262 0.61100 0.55172 0.62423 0.64316 0.63340 0.62729 0.62667

0.48406 0.48606 0.48805 0.48805 0.49402 0.47904 0.47305 0.46906 0.47059 0.45010 0.00000 0.38462 0.43002 0.47695 0.54268 0.55488 0.54158 0.60282 0.60685 0.50000 0.58932 0.53550 0.54728 0.54141 0.51319 0.54767 0.62955 0.63485 0.60277 0.60324 0.61446 0.62009 0.61382 0.65517 0.58384 0.58635 0.63911 0.64718 0.70270

0.47896 0.47896 0.47695 0.47695 0.48297 0.46988 0.46386 0.46586 0.47047 0.46856 0.38462 0.00000 0.41010 0.48262 0.53347 0.52761 0.51724 0.63083 0.63286 0.47191 0.57495 0.53061 0.54251 0.53846 0.52058 0.56680 0.61866 0.61954 0.59000 0.58586 0.57661 0.61572 0.60081 0.65517 0.58418 0.59013 0.62398 0.62880 0.67532

0.47992 0.48394 0.47791 0.48193 0.47590 0.47189 0.45984 0.46988 0.45175 0.45603 0.43002 0.41010 0.00000 0.47228 0.53689 0.53086 0.49898 0.60041 0.60041 0.46067 0.58299 0.52058 0.54898 0.55918 0.54768 0.55828 0.64213 0.64361 0.61723 0.61020 0.55242 0.62667 0.58932 0.65517 0.60243 0.60259 0.58522 0.59426 0.62338

0.49495 0.49697 0.49091 0.49091 0.49091 0.50505 0.50505 0.51111 0.48980 0.47325 0.47695 0.48262 0.47228 0.00000 0.54825 0.56531 0.52254 0.60692 0.60692 0.53371 0.59959 0.56762 0.55488 0.55918 0.55072 0.56735 0.64008 0.64509 0.60800 0.60327 0.61134 0.57018 0.58896 0.62069 0.61303 0.59915 0.63136 0.62525 0.68056

0.56478 0.56275 0.56275 0.56275 0.57287 0.55984 0.55172 0.54970 0.53593 0.53061 0.54268 0.53347 0.53689 0.54825 0.00000 0.41718 0.52546 0.58656 0.58859 0.52809 0.59381 0.56352 0.60489 0.59509 0.60097 0.61145 0.65779 0.65756 0.62626 0.63008 0.53036 0.55702 0.55123 0.51724 0.59754 0.65011 0.57927 0.58097 0.59459

0.55984 0.55984 0.55781 0.56187 0.56592 0.54970 0.56389 0.55984 0.53265 0.53484 0.55488 0.52761 0.53086 0.56531 0.41718 0.00000 0.54303 0.59592 0.60408 0.48315 0.62241 0.56879 0.61303 0.61020 0.61800 0.61837 0.66258 0.66876 0.63765 0.64969 0.56800 0.57269 0.57347 0.62069 0.60327 0.64454 0.56823 0.56504 0.56000

0.51116 0.50710 0.50304 0.50507 0.49696 0.51829 0.51220 0.51829 0.49590 0.48163 0.54158 0.51724 0.49898 0.52254 0.52546 0.54303 0.00000 0.52546 0.52546 0.43820 0.57438 0.52869 0.58130 0.58571 0.58981 0.57377 0.64622 0.64780 0.60852 0.59714 0.61382 0.60699 0.61759 0.51724 0.61934 0.60991 0.61303 0.62118 0.60811

0.61771 0.62374 0.61368 0.62173 0.61569 0.62097 0.61694 0.62500 0.58248 0.59714 0.60282 0.63083 0.60041 0.60692 0.58656 0.59592 0.52546 0.00000 0.00803 0.37640 0.58932 0.57520 0.63911 0.64372 0.63221 0.63692 0.67951 0.68399 0.68072 0.67611 0.64777 0.61233 0.64082 0.68966 0.64766 0.65525 0.64242 0.65859 0.70270

0.61972 0.62575 0.61569 0.62173 0.61569 0.62500 0.61895 0.62702 0.58452 0.59714 0.60685 0.63286 0.60041 0.60692 0.58859 0.60408 0.52546 0.00803 0.00000 0.37640 0.58932 0.57520 0.63911 0.64575 0.63221 0.63895 0.67951 0.68399 0.68072 0.67611 0.64372 0.61233 0.63673 0.68966 0.64969 0.65953 0.64444 0.65859 0.70270

0.51685 0.51685 0.51685 0.51685 0.51124 0.51685 0.50562 0.51124 0.47753 0.47191 0.50000 0.47191 0.46067 0.53371 0.52809 0.48315 0.43820 0.37640 0.37640 0.00000 0.45930 0.47191 0.57303 0.57865 0.57865 0.53371 0.67232 0.67232 0.58989 0.58427 0.80000 0.57303 0.57303 0.62069 0.60674 0.55814 0.61236 0.61236 2.00000

0.60532 0.60736 0.60532 0.60327 0.60123 0.61885 0.60656 0.61066 0.58091 0.59544 0.58932 0.57495 0.58299 0.59959 0.59381 0.62241 0.57438 0.58932 0.58932 0.45930 0.00000 0.50924 0.64271 0.63505 0.62408 0.61649 0.68802 0.69068 0.66531 0.65844 0.69106 0.64574 0.66874 0.69565 0.65010 0.66739 0.69547 0.69610 0.67568

0.56680 0.56883 0.57490 0.57085 0.56883 0.57606 0.56795 0.57606 0.54098 0.53086 0.53550 0.53061 0.52058 0.56762 0.56352 0.56879 0.52869 0.57520 0.57520 0.47191 0.50924 0.00000 0.59432 0.57434 0.57349 0.60408 0.64694 0.64854 0.64040 0.60489 0.64344 0.60262 0.63039 0.68966 0.62295 0.65733 0.64358 0.64634 0.65278

0.54819 0.54819 0.54418 0.54418 0.54819 0.55533 0.55131 0.55936 0.57085 0.53252 0.54728 0.54251 0.54898 0.55488 0.60489 0.61303 0.58130 0.63911 0.63911 0.57303 0.64271 0.59432 0.00000 0.10685 0.09569 0.41212 0.51919 0.52174 0.62124 0.61010 0.63454 0.65939 0.65650 0.72414 0.66057 0.66951 0.65657 0.65657 0.10811

0.54016 0.54217 0.53815 0.53815 0.54618 0.55533 0.56137 0.55332 0.58621 0.54379 0.54141 0.53846 0.55918 0.55918 0.59509 0.61020 0.58571 0.64372 0.64575 0.57865 0.63505 0.57434 0.10685 0.00000 0.05502 0.43232 0.51807 0.52058 0.62525 0.60643 0.62500 0.67249 0.65784 0.75862 0.65650 0.67521 0.65923 0.66329 0.10390

0.52038 0.52278 0.51799 0.51799 0.52518 0.54087 0.54808 0.54327 0.57488 0.52913 0.51319 0.52058 0.54768 0.55072 0.60097 0.61800 0.58981 0.63221 0.63221 0.57865 0.62408 0.57349 0.09569 0.05502 0.00000 0.41587 0.50600 0.50600 0.60287 0.59952 0.63529 0.65939 0.65860 0.68966 0.65937 0.66915 0.67229 0.66988 2.00000

0.56426 0.56426 0.56827 0.56426 0.55422 0.55734 0.55131 0.56137 0.55936 0.54970 0.54767 0.56680 0.55828 0.56735 0.61145 0.61837 0.57377 0.63692 0.63895 0.53371 0.61649 0.60408 0.41212 0.43232 0.41587 0.00000 0.55870 0.55809 0.62725 0.61290 0.67200 0.69432 0.67951 0.65517 0.64777 0.66737 0.61943 0.62828 0.50667

0.62777 0.62978 0.63179 0.62777 0.60563 0.63508 0.63306 0.62298 0.60976 0.63878 0.62955 0.61866 0.64213 0.64008 0.65779 0.66258 0.64622 0.67951 0.67951 0.67232 0.68802 0.64694 0.51919 0.51807 0.50600 0.55870 0.00000 0.00000 0.62651 0.65996 0.66129 0.67105 0.66939 0.72414 0.67006 0.69593 0.70161 0.70303 0.57143

0.62680 0.62887 0.63093 0.62680 0.60412 0.63430 0.63223 0.62190 0.61250 0.64017 0.63485 0.61954 0.64361 0.64509 0.65756 0.66876 0.64780 0.68399 0.68399 0.67232 0.69068 0.64854 0.52174 0.52058 0.50600 0.55809 0.00000 0.00000 0.62963 0.66392 0.66525 0.67105 0.67155 0.72414 0.66806 0.69593 0.70455 0.70600 0.60000

0.60236 0.60433 0.60630 0.60433 0.59843 0.58974 0.59369 0.59566 0.60081 0.58788 0.60277 0.59000 0.61723 0.60800 0.62626 0.63765 0.60852 0.68072 0.68072 0.58989 0.66531 0.64040 0.62124 0.62525 0.60287 0.62725 0.62651 0.62963 0.00000 0.40918 0.60956 0.66376 0.63765 0.65517 0.68924 0.69002 0.65392 0.65531 0.71429

0.59319 0.59519 0.59719 0.59519 0.59920 0.59639 0.59237 0.59237 0.61789 0.59267 0.60324 0.58586 0.61020 0.60327 0.63008 0.64969 0.59714 0.67611 0.67611 0.58427 0.65844 0.60489 0.61010 0.60643 0.59952 0.61290 0.65996 0.66392 0.40918 0.00000 0.63306 0.64912 0.63469 0.68966 0.71400 0.70664 0.67004 0.66734 0.61039

0.59761 0.60159 0.60956 0.60159 0.60159 0.59363 0.62151 0.62151 0.58964 0.60643 0.61446 0.57661 0.55242 0.61134 0.53036 0.56800 0.61382 0.64777 0.64372 0.80000 0.69106 0.64344 0.63454 0.62500 0.63529 0.67200 0.66129 0.66525 0.60956 0.63306 0.00000 0.69231 0.03586 1.00000 0.63306 0.69620 0.64516 0.65726 0.60811

0.60699 0.60699 0.60699 0.60699 0.58515 0.63158 0.61842 0.60526 0.62882 0.60262 0.62009 0.61572 0.62667 0.57018 0.55702 0.57269 0.60699 0.61233 0.61233 0.57303 0.64574 0.60262 0.65939 0.67249 0.65939 0.69432 0.67105 0.67105 0.66376 0.64912 0.69231 0.00000 0.00437 0.03448 0.62281 0.68950 0.59211 0.60526 2.00000

0.60121 0.60324 0.60729 0.60324 0.59312 0.61258 0.61460 0.61258 0.61460 0.61100 0.61382 0.60081 0.58932 0.58896 0.55123 0.57347 0.61759 0.64082 0.63673 0.57303 0.66874 0.63039 0.65650 0.65784 0.65860 0.67951 0.66939 0.67155 0.63765 0.63469 0.03586 0.00437 0.00000 0.03448 0.62781 0.68657 0.62449 0.63673 0.60811

0.65517 0.65517 0.65517 0.65517 0.62069 0.62069 0.62069 0.62069 0.68966 0.55172 0.65517 0.65517 0.65517 0.62069 0.51724 0.62069 0.51724 0.68966 0.68966 0.62069 0.69565 0.68966 0.72414 0.75862 0.68966 0.65517 0.72414 0.72414 0.65517 0.68966 1.00000 0.03448 0.03448 0.00000 0.65517 0.56522 0.65517 0.65517 2.00000

0.64471 0.64671 0.64271 0.64471 0.64870 0.64471 0.65070 0.64072 0.62805 0.62423 0.58384 0.58418 0.60243 0.61303 0.59754 0.60327 0.61934 0.64766 0.64969 0.60674 0.65010 0.62295 0.66057 0.65650 0.65937 0.64777 0.67006 0.66806 0.68924 0.71400 0.63306 0.62281 0.62781 0.65517 0.00000 0.67521 0.66735 0.67006 0.64935

0.60297 0.60510 0.60722 0.60510 0.60085 0.61996 0.62633 0.62633 0.63848 0.64316 0.58635 0.59013 0.60259 0.59915 0.65011 0.64454 0.60991 0.65525 0.65953 0.55814 0.66739 0.65733 0.66951 0.67521 0.66915 0.66737 0.69593 0.69593 0.69002 0.70664 0.69620 0.68950 0.68657 0.56522 0.67521 0.00000 0.69510 0.70513 0.72581

0.62777 0.62777 0.62777 0.62374 0.62978 0.60282 0.60081 0.60887 0.59838 0.63340 0.63911 0.62398 0.58522 0.63136 0.57927 0.56823 0.61303 0.64242 0.64444 0.61236 0.69547 0.64358 0.65657 0.65923 0.67229 0.61943 0.70161 0.70455 0.65392 0.67004 0.64516 0.59211 0.62449 0.65517 0.66735 0.69510 0.00000 0.13690 0.62162

0.63052 0.63052 0.62651 0.62651 0.63253 0.61569 0.61167 0.61972 0.61179 0.62729 0.64718 0.62880 0.59426 0.62525 0.58097 0.56504 0.62118 0.65859 0.65859 0.61236 0.69610 0.64634 0.65657 0.66329 0.66988 0.62828 0.70303 0.70600 0.65531 0.66734 0.65726 0.60526 0.63673 0.65517 0.67006 0.70513 0.13690 0.00000 0.63514

0.64935 0.64935 0.64935 0.64935 0.64935 0.67532 0.70130 0.70130 0.64000 0.62667 0.70270 0.67532 0.62338 0.68056 0.59459 0.56000 0.60811 0.70270 0.70270 2.00000 0.67568 0.65278 0.10811 0.10390 2.00000 0.50667 0.57143 0.60000 0.71429 0.61039 0.60811 2.00000 0.60811 2.00000 0.64935 0.72581 0.62162 0.63514 0.00000

Table D.5: The ABG_transport distance matrix with 39 species.

83

84

Q8UY94/1-2 Q8UY93/1-2 Q8UY92/1-2 Q64836/1-1 Q64835/1-2 E1A_ADE04/ Q8B8U9/1-2 Q8B8V0/1-2 Q8JSK3/1-2 Q8JSK4/1-2 Q8JSK2/1-2 Q8JSK1/1-2 E1A_ADE07/ E1A6_ADE07 Q9YLA0/1-1 Q9YLA1/1-1 Q9YLA2/1-2 E1A_ADE02/ Q67788/1-2 E1A_ADE05/ E1A6_ADE05 E1A6_ADE02 Q64874/1-2 Q8B6X6/1-2 E1A_ADES7/ Q64875/1-1 Q8B664/66E1A_ADE12/ Q64824/1-2 Q64881/1-2 Q89505/1-1 E1A_ADE40/ E1A_ADE41/ Q994F4/1-1 Q9EEG3/1-1 Q9W8M7/29Q9W8P4/128 Q9IGU0/128 E118_ADET1 Q64842/66E1A_ADECT/

0.00000 0.00000 0.00000 0.09424 0.09292 0.08560 0.42152 0.40945 0.38739 0.37945 0.36000 0.36400 0.37549 0.36000 0.52151 0.50769 0.56327 0.56485 0.60104 0.56485 0.77778 0.77778 0.56327 0.56735 0.56735 0.57609 0.53571 0.59109 0.60664 0.58996 0.58915 0.58996 0.59130 0.59420 0.65493 0.69136 0.51163 0.51220 0.62416 0.64138 0.71958

0.00000 0.00000 0.00000 0.10625 0.09292 0.09292 0.42152 0.42152 0.38739 0.38739 0.36000 0.36986 0.38288 0.36000 0.51913 0.50394 0.56075 0.58173 0.60104 0.58173 0.77778 0.77778 0.57009 0.57009 0.57009 0.57923 0.60000 0.61574 0.60577 0.60577 0.59375 0.60577 0.60804 0.62037 0.67857 0.71756 0.38462 0.38462 0.66102 0.67544 0.73333

0.00000 0.00000 0.00000 0.08000 0.08000 0.08000 0.32000 0.32000 0.36000 0.36000 0.36000 0.36000 0.36000 0.36000 0.58333 0.36364 0.58333 0.79167 0.79167 0.79167 0.79167 0.79167 0.79167 0.79167 0.79167 0.79167 2.00000 0.73913 0.80000 0.80000 0.80000 0.80000 0.80000 0.65000 0.73913 0.90476 2.00000 2.00000 0.85000 0.81818 0.83333

0.09424 0.10625 0.08000 0.00000 0.00625 0.00524 0.48101 0.45503 0.41772 0.40212 0.36000 0.38172 0.39153 0.36000 0.55738 0.56061 0.59669 0.58286 0.64341 0.58286 0.77778 0.77778 0.58333 0.58889 0.58889 0.60504 0.53571 0.60656 0.63699 0.60920 0.60938 0.60920 0.61404 0.59542 0.65957 0.71642 0.51163 0.51220 0.62712 0.63309 0.69504

0.09292 0.09292 0.08000 0.00625 0.00000 0.00000 0.45740 0.45740 0.41892 0.41892 0.36000 0.40183 0.40991 0.36000 0.52459 0.51181 0.56075 0.59135 0.61140 0.59135 0.77778 0.77778 0.59813 0.59813 0.59813 0.60656 0.60000 0.63889 0.62500 0.62500 0.60938 0.62500 0.62814 0.63889 0.68750 0.74046 0.38462 0.38462 0.65254 0.65789 0.72121

0.08560 0.09292 0.08000 0.00524 0.00000 0.00000 0.45740 0.44094 0.41892 0.40711 0.36000 0.39200 0.39921 0.36000 0.52688 0.51538 0.56327 0.57322 0.61140 0.57322 0.77778 0.77778 0.58776 0.59184 0.59184 0.60326 0.53571 0.61134 0.62559 0.60669 0.60465 0.60669 0.60870 0.60870 0.66197 0.70988 0.51163 0.51220 0.61745 0.62759 0.70899

0.42152 0.42152 0.32000 0.48101 0.45740 0.45740 0.00000 0.00000 0.23581 0.23581 0.20000 0.22222 0.23144 0.16000 0.51648 0.49606 0.56481 0.56190 0.57949 0.56190 0.82222 0.82222 0.56422 0.55505 0.55505 0.57609 0.56000 0.58824 0.64734 0.64734 0.65354 0.64251 0.63682 0.62617 0.62500 0.72308 0.30769 0.30769 0.71552 0.67257 0.71341

0.40945 0.42152 0.32000 0.45503 0.45740 0.44094 0.00000 0.00000 0.23581 0.21923 0.20000 0.20703 0.21538 0.16000 0.51892 0.50000 0.56680 0.54357 0.57949 0.54357 0.82222 0.82222 0.54217 0.53815 0.53815 0.57297 0.51786 0.56349 0.64762 0.62605 0.64844 0.62185 0.61638 0.60584 0.61268 0.68944 0.46512 0.46341 0.66667 0.63194 0.71277

0.38739 0.38739 0.36000 0.41772 0.41892 0.41892 0.23581 0.23581 0.00000 0.00000 0.00000 0.03097 0.01739 0.04000 0.47802 0.45669 0.53704 0.55714 0.57436 0.55714 0.75556 0.75556 0.56221 0.55300 0.55300 0.57377 0.64000 0.61644 0.65049 0.65049 0.64286 0.64563 0.61500 0.63551 0.62500 0.72093 0.30769 0.30769 0.70435 0.66964 0.65244

0.37945 0.38739 0.36000 0.40212 0.41892 0.40711 0.23581 0.21923 0.00000 0.00000 0.00000 0.02724 0.01533 0.04000 0.48108 0.46154 0.54251 0.54357 0.57436 0.54357 0.75556 0.75556 0.54435 0.54032 0.54032 0.57065 0.55357 0.58400 0.65072 0.62869 0.63780 0.62447 0.59740 0.61314 0.61268 0.68125 0.44186 0.46341 0.65753 0.62937 0.65957

0.36000 0.36000 0.36000 0.36000 0.36000 0.36000 0.20000 0.20000 0.00000 0.00000 0.00000 0.04000 0.04000 0.04000 0.54167 0.36364 0.54167 0.70833 0.70833 0.70833 0.70833 0.70833 0.79167 0.79167 0.79167 0.79167 2.00000 0.69565 0.85000 0.85000 0.85000 0.85000 0.85000 0.70000 0.69565 0.95238 2.00000 2.00000 0.80000 0.81818 0.83333

0.36400 0.36986 0.36000 0.38172 0.40183 0.39200 0.22222 0.20703 0.03097 0.02724 0.04000 0.00000 0.01946 0.00000 0.47568 0.46923 0.53846 0.53750 0.56701 0.53750 0.77778 0.77778 0.54251 0.54251 0.54251 0.56831 0.55357 0.58065 0.64593 0.62447 0.62992 0.62025 0.59740 0.60448 0.61268 0.68553 0.44186 0.46341 0.65068 0.62238 0.65957

0.37549 0.38288 0.36000 0.39153 0.40991 0.39921 0.23144 0.21538 0.01739 0.01533 0.04000 0.01946 0.00000 0.00000 0.48108 0.46154 0.54251 0.54772 0.57949 0.54772 0.77778 0.77778 0.54435 0.54032 0.54032 0.57065 0.55357 0.58400 0.65072 0.62869 0.63780 0.62447 0.60173 0.61314 0.61268 0.67500 0.44186 0.46341 0.66438 0.62238 0.65957

0.36000 0.36000 0.36000 0.36000 0.36000 0.36000 0.16000 0.16000 0.04000 0.04000 0.04000 0.00000 0.00000 0.00000 0.54167 0.36364 0.54167 0.75000 0.75000 0.75000 0.75000 0.75000 0.79167 0.79167 0.79167 0.79167 2.00000 0.69565 0.85000 0.85000 0.85000 0.85000 0.85000 0.70000 0.69565 0.95238 2.00000 2.00000 0.80000 0.81818 0.83333

0.52151 0.51913 0.58333 0.55738 0.52459 0.52688 0.51648 0.51892 0.47802 0.48108 0.54167 0.47568 0.48108 0.54167 0.00000 0.00000 0.01058 0.55978 0.55249 0.55978 0.68182 0.68182 0.61413 0.61413 0.61413 0.60989 0.66667 0.60870 0.61364 0.61453 0.61719 0.62011 0.60819 0.68605 0.64835 0.76238 1.00000 1.00000 0.68269 0.68687 0.69014

0.50769 0.50394 0.36364 0.56061 0.51181 0.51538 0.49606 0.50000 0.45669 0.46154 0.36364 0.46923 0.46154 0.36364 0.00000 0.00000 0.01504 0.57692 0.56693 0.57692 0.72727 0.72727 0.56818 0.56818 0.56818 0.56154 0.66667 0.60305 0.56557 0.56800 0.55128 0.57600 0.54701 0.67391 0.62500 0.71831 1.00000 1.00000 0.73438 0.62963 0.70874

0.56327 0.56075 0.58333 0.59669 0.56075 0.56327 0.56481 0.56680 0.53704 0.54251 0.54167 0.53846 0.54251 0.54167 0.01058 0.01504 0.00000 0.56087 0.55435 0.56087 0.68182 0.68182 0.63415 0.63415 0.63415 0.60989 0.62500 0.62602 0.64286 0.62185 0.61719 0.62605 0.60776 0.68702 0.65248 0.73248 0.60465 0.60976 0.63946 0.67361 0.68984

0.56485 0.58173 0.79167 0.58286 0.59135 0.57322 0.56190 0.54357 0.55714 0.54357 0.70833 0.53750 0.54772 0.75000 0.55978 0.57692 0.56087 0.00000 0.00823 0.01038 0.29545 0.29545 0.57872 0.58298 0.58298 0.61497 0.39130 0.54008 0.65990 0.64000 0.66406 0.64000 0.62727 0.61029 0.58647 0.60625 0.41860 0.43902 0.60544 0.65957 0.67172

0.60104 0.60104 0.79167 0.64341 0.61140 0.61140 0.57949 0.57949 0.57436 0.57436 0.70833 0.56701 0.57949 0.75000 0.55249 0.56693 0.55435 0.00823 0.00000 0.02058 0.29545 0.29545 0.62434 0.62434 0.62434 0.62366 2.00000 0.57068 0.68156 0.68156 0.66929 0.68156 0.66667 0.68132 0.68478 0.67544 2.00000 2.00000 0.71287 0.72632 0.67284

0.56485 0.58173 0.79167 0.58286 0.59135 0.57322 0.56190 0.54357 0.55714 0.54357 0.70833 0.53750 0.54772 0.75000 0.55978 0.57692 0.56087 0.01038 0.02058 0.00000 0.29545 0.29545 0.57872 0.58298 0.58298 0.61497 0.39130 0.54008 0.65990 0.64000 0.66406 0.64000 0.62727 0.61029 0.58647 0.60625 0.41860 0.43902 0.60544 0.65957 0.67677

0.77778 0.77778 0.79167 0.77778 0.77778 0.77778 0.82222 0.82222 0.75556 0.75556 0.70833 0.77778 0.77778 0.75000 0.68182 0.72727 0.68182 0.29545 0.29545 0.29545 0.00000 0.00000 0.81818 0.81818 0.81818 0.81818 2.00000 0.79545 0.78049 0.78049 0.78049 0.78049 0.82927 0.81579 0.69048 0.72414 2.00000 2.00000 0.71429 0.91176 0.78049

0.77778 0.77778 0.79167 0.77778 0.77778 0.77778 0.82222 0.82222 0.75556 0.75556 0.70833 0.77778 0.77778 0.75000 0.68182 0.72727 0.68182 0.29545 0.29545 0.29545 0.00000 0.00000 0.81818 0.81818 0.81818 0.81818 2.00000 0.79545 0.78049 0.78049 0.78049 0.78049 0.82927 0.81579 0.69048 0.72414 2.00000 2.00000 0.71429 0.91176 0.78049

0.56327 0.57009 0.79167 0.58333 0.59813 0.58776 0.56422 0.54217 0.56221 0.54435 0.79167 0.54251 0.54435 0.79167 0.61413 0.56818 0.63415 0.57872 0.62434 0.57872 0.81818 0.81818 0.00000 0.03774 0.04528 0.00000 0.50000 0.53876 0.62441 0.60581 0.62205 0.60166 0.61345 0.63235 0.64789 0.68072 0.44186 0.43902 0.61333 0.61429 0.68394

0.56735 0.57009 0.79167 0.58889 0.59813 0.59184 0.55505 0.53815 0.55300 0.54032 0.79167 0.54251 0.54032 0.79167 0.61413 0.56818 0.63415 0.58298 0.62434 0.58298 0.81818 0.81818 0.03774 0.00000 0.01132 0.04167 0.50000 0.54264 0.62441 0.60166 0.61417 0.59751 0.60924 0.63971 0.65493 0.68072 0.44186 0.43902 0.61333 0.61429 0.68912

0.56735 0.57009 0.79167 0.58889 0.59813 0.59184 0.55505 0.53815 0.55300 0.54032 0.79167 0.54251 0.54032 0.79167 0.61413 0.56818 0.63415 0.58298 0.62434 0.58298 0.81818 0.81818 0.04528 0.01132 0.00000 0.04167 0.50000 0.53488 0.62911 0.60581 0.61417 0.60166 0.61345 0.63971 0.65493 0.67470 0.44186 0.43902 0.61333 0.61429 0.68394

0.57609 0.57923 0.79167 0.60504 0.60656 0.60326 0.57609 0.57297 0.57377 0.57065 0.79167 0.56831 0.57065 0.79167 0.60989 0.56154 0.60989 0.61497 0.62366 0.61497 0.81818 0.81818 0.00000 0.04167 0.04167 0.00000 0.00000 0.55789 0.60571 0.60227 0.62205 0.59659 0.62573 0.63953 0.71111 0.75962 1.00000 1.00000 0.67327 0.66667 0.68056

0.53571 0.60000 2.00000 0.53571 0.60000 0.53571 0.56000 0.51786 0.64000 0.55357 2.00000 0.55357 0.55357 2.00000 0.66667 0.66667 0.62500 0.39130 2.00000 0.39130 2.00000 2.00000 0.50000 0.50000 0.50000 0.00000 0.00000 0.56667 0.59375 0.53333 0.00000 0.53333 0.53333 0.52083 0.58000 0.48333 0.37209 0.41463 0.00000 0.45833 0.69388

0.59109 0.61574 0.73913 0.60656 0.63889 0.61134 0.58824 0.56349 0.61644 0.58400 0.69565 0.58065 0.58400 0.69565 0.60870 0.60305 0.62602 0.54008 0.57068 0.54008 0.79545 0.79545 0.53876 0.54264 0.53488 0.55789 0.56667 0.00000 0.66977 0.65844 0.65116 0.65432 0.65833 0.64493 0.63830 0.68902 0.51163 0.51220 0.67550 0.65248 0.70103

0.60664 0.60577 0.80000 0.63699 0.62500 0.62559 0.64734 0.64762 0.65049 0.65072 0.85000 0.64593 0.65072 0.85000 0.61364 0.56557 0.64286 0.65990 0.68156 0.65990 0.78049 0.78049 0.62441 0.62441 0.62911 0.60571 0.59375 0.66977 0.00000 0.00455 0.02256 0.00909 0.22857 0.65421 0.61468 0.65079 0.31250 0.31250 0.66116 0.70536 0.69512

0.58996 0.60577 0.80000 0.60920 0.62500 0.60669 0.64734 0.62605 0.65049 0.62869 0.85000 0.62447 0.62869 0.85000 0.61453 0.56800 0.62185 0.64000 0.68156 0.64000 0.78049 0.78049 0.60581 0.60166 0.60581 0.60227 0.53333 0.65844 0.00455 0.00000 0.02985 0.01210 0.20588 0.62687 0.61029 0.63636 0.46512 0.43902 0.62416 0.63571 0.69730

0.58915 0.59375 0.80000 0.60938 0.60938 0.60465 0.65354 0.64844 0.64286 0.63780 0.85000 0.62992 0.63780 0.85000 0.61719 0.55128 0.61719 0.66406 0.66929 0.66406 0.78049 0.78049 0.62205 0.61417 0.61417 0.62205 0.00000 0.65116 0.02256 0.02985 0.00000 0.00746 0.18898 0.69643 0.64706 0.64615 1.00000 1.00000 0.74684 0.75472 0.70408

0.58996 0.60577 0.80000 0.60920 0.62500 0.60669 0.64251 0.62185 0.64563 0.62447 0.85000 0.62025 0.62447 0.85000 0.62011 0.57600 0.62605 0.64000 0.68156 0.64000 0.78049 0.78049 0.60166 0.59751 0.60166 0.59659 0.53333 0.65432 0.00909 0.01210 0.00746 0.00000 0.21429 0.62687 0.61029 0.63636 0.46512 0.43902 0.62416 0.63571 0.69730

0.59130 0.60804 0.80000 0.61404 0.62814 0.60870 0.63682 0.61638 0.61500 0.59740 0.85000 0.59740 0.60173 0.85000 0.60819 0.54701 0.60776 0.62727 0.66667 0.62727 0.82927 0.82927 0.61345 0.60924 0.61345 0.62573 0.53333 0.65833 0.22857 0.20588 0.18898 0.21429 0.00000 0.61364 0.62500 0.64935 0.46512 0.43902 0.65333 0.64493 0.71429

0.59420 0.62037 0.65000 0.59542 0.63889 0.60870 0.62617 0.60584 0.63551 0.61314 0.70000 0.60448 0.61314 0.70000 0.68605 0.67391 0.68702 0.61029 0.68132 0.61029 0.81579 0.81579 0.63235 0.63971 0.63971 0.63953 0.52083 0.64493 0.65421 0.62687 0.69643 0.62687 0.61364 0.00000 0.50459 0.65657 0.41860 0.51220 0.64000 0.67290 0.70000

0.65493 0.67857 0.73913 0.65957 0.68750 0.66197 0.62500 0.61268 0.62500 0.61268 0.69565 0.61268 0.61268 0.69565 0.64835 0.62500 0.65248 0.58647 0.68478 0.58647 0.69048 0.69048 0.64789 0.65493 0.65493 0.71111 0.58000 0.63830 0.61468 0.61029 0.64706 0.61029 0.62500 0.50459 0.00000 0.67544 0.39024 0.48718 0.60825 0.67857 0.72072

0.69136 0.71756 0.90476 0.71642 0.74046 0.70988 0.72308 0.68944 0.72093 0.68125 0.95238 0.68553 0.67500 0.95238 0.76238 0.71831 0.73248 0.60625 0.67544 0.60625 0.72414 0.72414 0.68072 0.68072 0.67470 0.75962 0.48333 0.68902 0.65079 0.63636 0.64615 0.63636 0.64935 0.65657 0.67544 0.00000 0.00000 0.12195 0.66055 0.72566 0.76471

0.51163 0.38462 2.00000 0.51163 0.38462 0.51163 0.30769 0.46512 0.30769 0.44186 2.00000 0.44186 0.44186 2.00000 1.00000 1.00000 0.60465 0.41860 2.00000 0.41860 2.00000 2.00000 0.44186 0.44186 0.44186 1.00000 0.37209 0.51163 0.31250 0.46512 1.00000 0.46512 0.46512 0.41860 0.39024 0.00000 0.00000 0.12195 0.37209 0.51163 0.61765

0.51220 0.38462 2.00000 0.51220 0.38462 0.51220 0.30769 0.46341 0.30769 0.46341 2.00000 0.46341 0.46341 2.00000 1.00000 1.00000 0.60976 0.43902 2.00000 0.43902 2.00000 2.00000 0.43902 0.43902 0.43902 1.00000 0.41463 0.51220 0.31250 0.43902 1.00000 0.43902 0.43902 0.51220 0.48718 0.12195 0.12195 0.00000 0.41463 0.48780 0.64706

0.62416 0.66102 0.85000 0.62712 0.65254 0.61745 0.71552 0.66667 0.70435 0.65753 0.80000 0.65068 0.66438 0.80000 0.68269 0.73438 0.63946 0.60544 0.71287 0.60544 0.71429 0.71429 0.61333 0.61333 0.61333 0.67327 0.00000 0.67550 0.66116 0.62416 0.74684 0.62416 0.65333 0.64000 0.60825 0.66055 0.37209 0.41463 0.00000 0.64220 0.77311

0.64138 0.67544 0.81818 0.63309 0.65789 0.62759 0.67257 0.63194 0.66964 0.62937 0.81818 0.62238 0.62238 0.81818 0.68687 0.62963 0.67361 0.65957 0.72632 0.65957 0.91176 0.91176 0.61429 0.61429 0.61429 0.66667 0.45833 0.65248 0.70536 0.63571 0.75472 0.63571 0.64493 0.67290 0.67857 0.72566 0.51163 0.48780 0.64220 0.00000 0.72477

0.71958 0.73333 0.83333 0.69504 0.72121 0.70899 0.71341 0.71277 0.65244 0.65957 0.83333 0.65957 0.65957 0.83333 0.69014 0.70874 0.68984 0.67172 0.67284 0.67677 0.78049 0.78049 0.68394 0.68912 0.68394 0.68056 0.69388 0.70103 0.69512 0.69730 0.70408 0.69730 0.71429 0.70000 0.72072 0.76471 0.61765 0.64706 0.77311 0.72477 0.00000

Table D.6: The Adeno_E1A distance matrix with 41 species. 0.00000 0.61695 0.66848 0.62798 0.65206 0.68653 0.75726 0.73057 0.72987 0.74935 0.75196 0.75718 0.75718 0.68586 0.70270 0.70496 0.70496 0.69452 0.69452 0.70270 0.71802 0.73158 0.73158 0.75000 0.72585 0.72585 0.72093 0.81132

0.61695 0.00000 0.01338 0.44828 0.64189 0.69257 0.73264 0.71769 0.71769 0.71864 0.72203 0.71864 0.71864 0.68041 1.00000 0.70205 0.68493 0.68151 0.67808 1.00000 0.71233 0.68621 0.68966 2.00000 0.68493 0.68493 2.00000 0.70270

0.66848 0.01338 0.00000 0.52201 0.67480 0.72087 0.75346 0.74659 0.74590 0.74114 0.74659 0.74114 0.74114 0.70330 0.77586 0.71233 0.69863 0.70411 0.70137 0.77586 0.72055 0.71901 0.72176 0.94118 0.71233 0.71233 0.83871 0.73585

0.62798 0.44828 0.52201 0.00000 0.69345 0.68358 0.70571 0.76647 0.75676 0.76647 0.76946 0.76946 0.76946 0.67674 0.71622 0.68072 0.67470 0.68373 0.68373 0.64865 0.68675 0.69301 0.67781 0.64286 0.69670 0.69069 0.67442 1.00000

0.65206 0.64189 0.67480 0.69345 0.00000 0.67183 0.74934 0.73643 0.73057 0.74479 0.74740 0.73438 0.73438 0.72324 0.77027 0.72135 0.71875 0.70833 0.70573 0.70270 0.70833 0.73228 0.73753 0.82143 0.72396 0.72135 0.76744 0.87037

0.68653 0.69257 0.72087 0.68358 0.67183 0.00000 0.71958 0.74286 0.73698 0.74479 0.75000 0.76042 0.76042 0.69895 0.71233 0.70757 0.71540 0.72063 0.72324 0.78082 0.71279 0.72105 0.72895 0.85714 0.70235 0.69974 0.81395 0.85185

0.75726 0.73264 0.75346 0.70571 0.74934 0.71958 0.00000 0.77428 0.77105 0.79683 0.79947 0.79947 0.79947 0.73670 0.77027 0.75200 0.76000 0.76267 0.76267 0.77027 0.76000 0.75806 0.77151 0.78571 0.75733 0.76000 0.69767 0.87500

0.73057 0.71769 0.74659 0.76647 0.73643 0.74286 0.77428 0.00000 0.01005 0.79112 0.79373 0.79373 0.79373 0.77285 0.78378 0.74219 0.75260 0.76823 0.77083 0.78378 0.76042 0.76903 0.77953 0.92857 0.75781 0.75521 0.74419 0.87037

0.72987 0.71769 0.74590 0.75676 0.73057 0.73698 0.77105 0.01005 0.00000 0.78272 0.78534 0.78534 0.78534 0.76702 0.75342 0.73629 0.74674 0.75979 0.76240 0.73973 0.75718 0.76053 0.77105 0.81481 0.74935 0.74674 0.66667 0.87037

0.74935 0.71864 0.74114 0.76647 0.74479 0.74479 0.79683 0.79112 0.78272 0.00000 0.00517 0.11886 0.11886 0.74474 0.73611 0.74474 0.74211 0.74474 0.74211 0.73611 0.76316 0.74801 0.76658 0.75000 0.76579 0.76579 0.71429 0.84615

0.75196 0.72203 0.74659 0.76946 0.74740 0.75000 0.79947 0.79373 0.78534 0.00517 0.00000 0.12403 0.12403 0.74737 0.75000 0.74737 0.74474 0.74737 0.74474 0.75000 0.76579 0.75332 0.77188 0.75000 0.77105 0.77105 0.73810 0.84615

0.75718 0.71864 0.74114 0.76946 0.73438 0.76042 0.79947 0.79373 0.78534 0.11886 0.12403 0.00000 0.00000 0.73158 0.72222 0.72632 0.73158 0.74211 0.73947 0.72222 0.74737 0.74005 0.75862 0.75000 0.76053 0.76053 0.71429 0.84615

0.75718 0.71864 0.74114 0.76946 0.73438 0.76042 0.79947 0.79373 0.78534 0.11886 0.12403 0.00000 0.00000 0.73158 0.72222 0.72632 0.73158 0.74211 0.73947 0.72222 0.74737 0.74005 0.75862 0.75000 0.76053 0.76053 0.71429 0.84615

0.68586 0.68041 0.70330 0.67674 0.72324 0.69895 0.73670 0.77285 0.76702 0.74474 0.74737 0.73158 0.73158 0.00000 0.06757 0.24543 0.23238 0.38903 0.38381 0.29730 0.38903 0.48421 0.48947 0.57143 0.47258 0.46997 0.53488 0.62963

0.70270 1.00000 0.77586 0.71622 0.77027 0.71233 0.77027 0.78378 0.75342 0.73611 0.75000 0.72222 0.72222 0.06757 0.00000 0.25676 0.21622 0.33784 0.33784 0.33784 0.31081 0.51351 0.51351 0.57143 0.50000 0.50000 0.53488 2.00000

0.70496 0.70205 0.71233 0.68072 0.72135 0.70757 0.75200 0.74219 0.73629 0.74474 0.74737 0.72632 0.72632 0.24543 0.25676 0.00000 0.12240 0.41667 0.41927 0.32432 0.39323 0.49606 0.50394 0.64286 0.49219 0.48177 0.60465 0.61111

0.70496 0.68493 0.69863 0.67470 0.71875 0.71540 0.76000 0.75260 0.74674 0.74211 0.74474 0.73158 0.73158 0.23238 0.21622 0.12240 0.00000 0.41406 0.41406 0.33784 0.38281 0.48031 0.49081 0.60714 0.46094 0.45052 0.55814 0.55556

0.69452 0.68151 0.70411 0.68373 0.70833 0.72063 0.76267 0.76823 0.75979 0.74474 0.74737 0.74211 0.74211 0.38903 0.33784 0.41667 0.41406 0.00000 0.00781 0.00000 0.42708 0.48556 0.47507 0.50000 0.45573 0.44792 0.46512 0.72222

0.69452 0.67808 0.70137 0.68373 0.70573 0.72324 0.76267 0.77083 0.76240 0.74211 0.74474 0.73947 0.73947 0.38381 0.33784 0.41927 0.41406 0.00781 0.00000 0.00000 0.42448 0.48031 0.46982 0.50000 0.45052 0.44271 0.46512 0.68519

0.70270 1.00000 0.77586 0.64865 0.70270 0.78082 0.77027 0.78378 0.73973 0.73611 0.75000 0.72222 0.72222 0.29730 0.33784 0.32432 0.33784 0.00000 0.00000 0.00000 0.33784 0.45946 0.44595 0.50000 0.41892 0.41892 0.46512 2.00000

0.71802 0.71233 0.72055 0.68675 0.70833 0.71279 0.76000 0.76042 0.75718 0.76316 0.76579 0.74737 0.74737 0.38903 0.31081 0.39323 0.38281 0.42708 0.42448 0.33784 0.00000 0.50394 0.51181 0.50000 0.49479 0.49219 0.51163 0.59259

0.73158 0.68621 0.71901 0.69301 0.73228 0.72105 0.75806 0.76903 0.76053 0.74801 0.75332 0.74005 0.74005 0.48421 0.51351 0.49606 0.48031 0.48556 0.48031 0.45946 0.50394 0.00000 0.09974 0.03571 0.37795 0.38058 0.32558 0.51852

Table D.7: The Adeno_E1B_55K distance matrix with 28 species.

0.73158 0.68966 0.72176 0.67781 0.73753 0.72895 0.77151 0.77953 0.77105 0.76658 0.77188 0.75862 0.75862 0.48947 0.51351 0.50394 0.49081 0.47507 0.46982 0.44595 0.51181 0.09974 0.00000 0.00000 0.39895 0.40157 0.34884 0.50000

0.75000 2.00000 0.94118 0.64286 0.82143 0.85714 0.78571 0.92857 0.81481 0.75000 0.75000 0.75000 0.75000 0.57143 0.57143 0.64286 0.60714 0.50000 0.50000 0.50000 0.50000 0.03571 0.00000 0.00000 0.25000 0.25000 0.25000 2.00000

0.72585 0.68493 0.71233 0.69670 0.72396 0.70235 0.75733 0.75781 0.74935 0.76579 0.77105 0.76053 0.76053 0.47258 0.50000 0.49219 0.46094 0.45573 0.45052 0.41892 0.49479 0.37795 0.39895 0.25000 0.00000 0.01039 0.04651 0.42593

0.72585 0.68493 0.71233 0.69069 0.72135 0.69974 0.76000 0.75521 0.74674 0.76579 0.77105 0.76053 0.76053 0.46997 0.50000 0.48177 0.45052 0.44792 0.44271 0.41892 0.49219 0.38058 0.40157 0.25000 0.01039 0.00000 0.04651 0.42593

0.72093 2.00000 0.83871 0.67442 0.76744 0.81395 0.69767 0.74419 0.66667 0.71429 0.73810 0.71429 0.71429 0.53488 0.53488 0.60465 0.55814 0.46512 0.46512 0.46512 0.51163 0.32558 0.34884 0.25000 0.04651 0.04651 0.00000 2.00000

0.81132 0.70270 0.73585 1.00000 0.87037 0.85185 0.87500 0.87037 0.87037 0.84615 0.84615 0.84615 0.84615 0.62963 2.00000 0.61111 0.55556 0.72222 0.68519 2.00000 0.59259 0.51852 0.50000 2.00000 0.42593 0.42593 2.00000 0.00000

Distance Matrices

Q9W859/66Q9PXF7/1-2 Q9EEG1/13Q994F1/1-3 E1BL_ADET1 Q64844/34Q83459/27Q64892/36E1BL_ADEM1 E1BL_ADECR E1BL_ADECC P90264/55E1BL_ADECT Q8UY89/112 Q8UY90/93Q8B8U5/108 E1BL_ADE07 E1BL_ADE05 E1BL_ADE02 Q67790/79Q9YL98/109 E1BL_ADE41 E1BL_ADE40 Q64856/74E1BL_ADE12 Q64837/95Q89799/65Q64876/102

0.00000 0.21212 0.36893 0.36893 0.37864 0.39806 0.38835 0.38835 0.38835 0.38835 0.38835 0.38835 0.38835 0.38835 0.38835 0.38835 0.38835 0.38835 0.39806 0.39806 0.57009 0.56075 0.56075 0.58879 0.63107 0.61616 0.61616 0.61616 0.63107 0.66279 0.69565

0.21212 0.00000 0.38384 0.38384 0.37374 0.41414 0.38384 0.37374 0.37374 0.37374 0.37374 0.37374 0.37374 0.37374 0.37374 0.37374 0.37374 0.37374 0.39394 0.39394 0.52041 0.51020 0.52041 0.55102 0.63265 0.59375 0.59375 0.59375 0.64286 0.66667 0.67778

0.36893 0.38384 0.00000 0.00971 0.01942 0.07767 0.30097 0.30097 0.30097 0.30097 0.30097 0.30097 0.30097 0.30097 0.30097 0.30097 0.30097 0.30097 0.31068 0.31068 0.55000 0.55000 0.57000 0.59000 0.62745 0.57143 0.57143 0.57143 0.61765 0.67442 0.70652

0.36893 0.38384 0.00971 0.00000 0.00971 0.06796 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.32039 0.32039 0.55000 0.55000 0.57000 0.59000 0.62745 0.57143 0.57143 0.57143 0.61765 0.67442 0.70652

0.37864 0.37374 0.01942 0.00971 0.00000 0.07767 0.32039 0.32039 0.32039 0.32039 0.32039 0.32039 0.32039 0.32039 0.32039 0.32039 0.32039 0.32039 0.33010 0.33010 0.54000 0.54000 0.56000 0.58000 0.63725 0.58163 0.58163 0.58163 0.61765 0.67442 0.70652

0.39806 0.41414 0.07767 0.06796 0.07767 0.00000 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.31068 0.32039 0.32039 0.32039 0.53000 0.53000 0.55000 0.57000 0.64706 0.59184 0.59184 0.59184 0.60784 0.69767 0.70652

0.38835 0.38384 0.30097 0.31068 0.32039 0.31068 0.00000 0.01942 0.01942 0.01942 0.01942 0.01942 0.01942 0.01942 0.02913 0.02913 0.02913 0.02913 0.03883 0.03883 0.45000 0.45000 0.47000 0.48000 0.64706 0.60204 0.60204 0.60204 0.66667 0.66279 0.65217

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.01942 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00971 0.00971 0.00971 0.00971 0.01942 0.02913 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.65686 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.01942 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00971 0.00971 0.00971 0.00971 0.01942 0.02913 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.65686 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.01942 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00971 0.00971 0.00971 0.00971 0.01942 0.02913 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.65686 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.01942 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00971 0.00971 0.00971 0.00971 0.01942 0.02913 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.65686 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.01942 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00971 0.00971 0.00971 0.00971 0.01942 0.02913 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.65686 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.01942 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00971 0.00971 0.00971 0.00971 0.01942 0.02913 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.65686 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.01942 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00971 0.00971 0.00971 0.00971 0.01942 0.02913 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.65686 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.02913 0.00971 0.00971 0.00971 0.00971 0.00971 0.00971 0.00971 0.00000 0.00000 0.00000 0.01942 0.02913 0.03883 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.64706 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.02913 0.00971 0.00971 0.00971 0.00971 0.00971 0.00971 0.00971 0.00000 0.00000 0.00000 0.01942 0.02913 0.03883 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.64706 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.31068 0.02913 0.00971 0.00971 0.00971 0.00971 0.00971 0.00971 0.00971 0.00000 0.00000 0.00000 0.01942 0.02913 0.03883 0.45000 0.45000 0.47000 0.48000 0.65686 0.61224 0.61224 0.61224 0.64706 0.66279 0.66304

0.38835 0.37374 0.30097 0.31068 0.32039 0.32039 0.02913 0.00971 0.00971 0.00971 0.00971 0.00971 0.00971 0.00971 0.01942 0.01942 0.01942 0.00000 0.02913 0.03883 0.44000 0.44000 0.46000 0.47000 0.65686 0.61224 0.61224 0.61224 0.65686 0.65116 0.66304

0.39806 0.39394 0.31068 0.32039 0.33010 0.32039 0.03883 0.01942 0.01942 0.01942 0.01942 0.01942 0.01942 0.01942 0.02913 0.02913 0.02913 0.02913 0.00000 0.04854 0.47000 0.47000 0.49000 0.50000 0.66667 0.62245 0.62245 0.62245 0.66667 0.68605 0.66304

0.39806 0.39394 0.31068 0.32039 0.33010 0.32039 0.03883 0.02913 0.02913 0.02913 0.02913 0.02913 0.02913 0.02913 0.03883 0.03883 0.03883 0.03883 0.04854 0.00000 0.46000 0.46000 0.48000 0.49000 0.66667 0.62245 0.62245 0.62245 0.66667 0.67442 0.66304

0.57009 0.52041 0.55000 0.55000 0.54000 0.53000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.44000 0.47000 0.46000 0.00000 0.00926 0.06481 0.13889 0.64356 0.60606 0.60606 0.60606 0.64356 0.66667 0.70000

0.56075 0.51020 0.55000 0.55000 0.54000 0.53000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.45000 0.44000 0.47000 0.46000 0.00926 0.00000 0.05556 0.12963 0.64356 0.60606 0.60606 0.60606 0.64356 0.66667 0.70000

0.56075 0.52041 0.57000 0.57000 0.56000 0.55000 0.47000 0.47000 0.47000 0.47000 0.47000 0.47000 0.47000 0.47000 0.47000 0.47000 0.47000 0.46000 0.49000 0.48000 0.06481 0.05556 0.00000 0.08333 0.66337 0.61616 0.61616 0.61616 0.66337 0.67857 0.70000

0.58879 0.55102 0.59000 0.59000 0.58000 0.57000 0.48000 0.48000 0.48000 0.48000 0.48000 0.48000 0.48000 0.48000 0.48000 0.48000 0.48000 0.47000 0.50000 0.49000 0.13889 0.12963 0.08333 0.00000 0.66337 0.61616 0.61616 0.61616 0.67327 0.70238 0.71111

0.63107 0.63265 0.62745 0.62745 0.63725 0.64706 0.64706 0.65686 0.65686 0.65686 0.65686 0.65686 0.65686 0.65686 0.65686 0.65686 0.65686 0.65686 0.66667 0.66667 0.64356 0.64356 0.66337 0.66337 0.00000 0.13000 0.13000 0.13000 0.16346 0.59302 0.66304

0.61616 0.59375 0.57143 0.57143 0.58163 0.59184 0.60204 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.62245 0.62245 0.60606 0.60606 0.61616 0.61616 0.13000 0.00000 0.00000 0.02000 0.19000 0.60976 0.61798

0.61616 0.59375 0.57143 0.57143 0.58163 0.59184 0.60204 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.62245 0.62245 0.60606 0.60606 0.61616 0.61616 0.13000 0.00000 0.00000 0.02000 0.19000 0.60976 0.61798

0.61616 0.59375 0.57143 0.57143 0.58163 0.59184 0.60204 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.61224 0.62245 0.62245 0.60606 0.60606 0.61616 0.61616 0.13000 0.02000 0.02000 0.00000 0.19000 0.60976 0.61798

0.63107 0.64286 0.61765 0.61765 0.61765 0.60784 0.66667 0.65686 0.65686 0.65686 0.65686 0.65686 0.65686 0.65686 0.64706 0.64706 0.64706 0.65686 0.66667 0.66667 0.64356 0.64356 0.66337 0.67327 0.16346 0.19000 0.19000 0.19000 0.00000 0.63953 0.66304

0.66279 0.66667 0.67442 0.67442 0.67442 0.69767 0.66279 0.66279 0.66279 0.66279 0.66279 0.66279 0.66279 0.66279 0.66279 0.66279 0.66279 0.65116 0.68605 0.67442 0.66667 0.66667 0.67857 0.70238 0.59302 0.60976 0.60976 0.60976 0.63953 0.00000 0.59524

0.69565 0.67778 0.70652 0.70652 0.70652 0.70652 0.65217 0.66304 0.66304 0.66304 0.66304 0.66304 0.66304 0.66304 0.66304 0.66304 0.66304 0.66304 0.66304 0.66304 0.70000 0.70000 0.70000 0.71111 0.66304 0.61798 0.61798 0.61798 0.66304 0.59524 0.00000

0.50000 0.41667 0.58333 0.58333 0.66667 0.58333 0.50000 0.25000 0.50000 0.58333 0.50000 0.58333 0.41667 0.50000 0.50000 0.50000 0.50000 0.58333 0.50000 0.50000 0.41667 0.75000 0.66667 0.41667 0.75000 0.58333 0.66667 0.58333 0.00000 0.50000

0.83333 0.41667 0.41667 0.25000 1.00000 0.91667 0.83333 0.41667 0.83333 0.25000 0.83333 0.41667 0.75000 0.50000 0.50000 0.83333 0.83333 0.91667 0.33333 0.83333 0.75000 1.08333 1.00000 0.75000 1.08333 0.91667 1.00000 0.91667 0.50000 0.00000

Distance Matrices

Q8UY70/32Q8BEL0/34Q8JW22/32Q8B8S3/32Q67731/32Q91PK7/32Q83120/32E315_ADE07 Q99I45/14Q99I47/14Q99I59/14Q99I57/14Q99I55/14Q99I53/14Q9QL90/32Q99I49/14Q99I51/14E315_ADE03 Q91CK9/43Q91CJ9/43Q9ICE4/23Q9JFL8/22Q9JFM4/22Q9JFM1/22E315_ADE05 E315_ADE02 Q910Z2/31E315_ADE06 O12401/30Q89461/17E315_ADE12

Table D.8: The Adeno_E3_14_5 distance matrix with 31 species. species00 species01 species02 species03 species04 species05 species06 species07 species08 species09 species10 species11 species12 species13 species14 species15 species16 species17 species18 species19 species20 species21 species22 species23 species24 species25 species26 species27 species28 species29

0.00000 0.75000 0.91667 0.91667 0.66667 0.75000 0.66667 0.58333 0.16667 0.91667 0.50000 0.91667 0.58333 0.83333 0.83333 0.33333 0.50000 0.58333 0.83333 0.33333 0.58333 0.75000 0.66667 0.58333 0.75000 0.58333 0.66667 0.75000 0.50000 0.83333

0.75000 0.00000 0.50000 0.50000 0.91667 0.83333 0.75000 0.33333 0.75000 0.50000 0.75000 0.50000 0.66667 0.25000 0.25000 0.75000 0.75000 0.83333 0.41667 0.75000 0.66667 1.00000 0.91667 0.66667 1.00000 0.83333 0.91667 0.83333 0.41667 0.41667

0.91667 0.50000 0.00000 0.50000 1.08333 1.00000 0.91667 0.50000 0.91667 0.50000 0.91667 0.16667 0.83333 0.58333 0.58333 0.91667 0.91667 1.00000 0.25000 0.91667 0.83333 1.16667 1.08333 0.83333 1.16667 1.00000 1.08333 1.00000 0.58333 0.41667

0.91667 0.50000 0.50000 0.00000 1.08333 1.00000 0.91667 0.50000 0.91667 0.16667 0.91667 0.50000 0.83333 0.58333 0.58333 0.91667 0.91667 1.00000 0.41667 0.91667 0.83333 1.16667 1.08333 0.83333 1.16667 1.00000 1.08333 1.00000 0.58333 0.25000

0.66667 0.91667 1.08333 1.08333 0.00000 0.91667 0.83333 0.75000 0.66667 1.08333 0.50000 1.08333 0.75000 1.00000 1.00000 0.66667 0.33333 0.58333 1.00000 0.66667 0.75000 0.75000 0.66667 0.75000 0.75000 0.25000 0.16667 0.91667 0.66667 1.00000

0.75000 0.83333 1.00000 1.00000 0.91667 0.00000 0.25000 0.66667 0.75000 1.00000 0.75000 1.00000 0.50000 0.91667 0.91667 0.75000 0.75000 0.83333 0.91667 0.75000 0.33333 1.00000 0.91667 0.50000 1.00000 0.83333 0.91667 0.16667 0.58333 0.91667

0.66667 0.75000 0.91667 0.91667 0.83333 0.25000 0.00000 0.58333 0.66667 0.91667 0.66667 0.91667 0.41667 0.83333 0.83333 0.66667 0.66667 0.75000 0.83333 0.66667 0.25000 0.91667 0.83333 0.41667 0.91667 0.75000 0.83333 0.25000 0.50000 0.83333

0.58333 0.33333 0.50000 0.50000 0.75000 0.66667 0.58333 0.00000 0.58333 0.50000 0.58333 0.50000 0.50000 0.41667 0.41667 0.58333 0.58333 0.66667 0.41667 0.58333 0.50000 0.83333 0.75000 0.50000 0.83333 0.66667 0.75000 0.66667 0.25000 0.41667

0.16667 0.75000 0.91667 0.91667 0.66667 0.75000 0.66667 0.58333 0.00000 0.91667 0.50000 0.91667 0.58333 0.83333 0.83333 0.33333 0.50000 0.58333 0.83333 0.33333 0.58333 0.75000 0.66667 0.58333 0.75000 0.58333 0.66667 0.75000 0.50000 0.83333

0.91667 0.50000 0.50000 0.16667 1.08333 1.00000 0.91667 0.50000 0.91667 0.00000 0.91667 0.50000 0.83333 0.58333 0.58333 0.91667 0.91667 1.00000 0.41667 0.91667 0.83333 1.16667 1.08333 0.83333 1.16667 1.00000 1.08333 1.00000 0.58333 0.25000

0.50000 0.75000 0.91667 0.91667 0.50000 0.75000 0.66667 0.58333 0.50000 0.91667 0.00000 0.91667 0.58333 0.83333 0.83333 0.50000 0.33333 0.25000 0.83333 0.50000 0.58333 0.41667 0.33333 0.58333 0.41667 0.41667 0.50000 0.75000 0.50000 0.83333

0.91667 0.50000 0.16667 0.50000 1.08333 1.00000 0.91667 0.50000 0.91667 0.50000 0.91667 0.00000 0.83333 0.58333 0.58333 0.91667 0.91667 1.00000 0.25000 0.91667 0.83333 1.16667 1.08333 0.83333 1.16667 1.00000 1.08333 1.00000 0.58333 0.41667

0.58333 0.66667 0.83333 0.83333 0.75000 0.50000 0.41667 0.50000 0.58333 0.83333 0.58333 0.83333 0.00000 0.75000 0.75000 0.58333 0.58333 0.66667 0.75000 0.58333 0.33333 0.83333 0.75000 0.16667 0.83333 0.66667 0.75000 0.50000 0.41667 0.75000

0.83333 0.25000 0.58333 0.58333 1.00000 0.91667 0.83333 0.41667 0.83333 0.58333 0.83333 0.58333 0.75000 0.00000 0.16667 0.83333 0.83333 0.91667 0.50000 0.83333 0.75000 1.08333 1.00000 0.75000 1.08333 0.91667 1.00000 0.91667 0.50000 0.50000

0.83333 0.25000 0.58333 0.58333 1.00000 0.91667 0.83333 0.41667 0.83333 0.58333 0.83333 0.58333 0.75000 0.16667 0.00000 0.83333 0.83333 0.91667 0.50000 0.83333 0.75000 1.08333 1.00000 0.75000 1.08333 0.91667 1.00000 0.91667 0.50000 0.50000

0.33333 0.75000 0.91667 0.91667 0.66667 0.75000 0.66667 0.58333 0.33333 0.91667 0.50000 0.91667 0.58333 0.83333 0.83333 0.00000 0.50000 0.58333 0.83333 0.16667 0.58333 0.75000 0.66667 0.58333 0.75000 0.58333 0.66667 0.75000 0.50000 0.83333

0.50000 0.75000 0.91667 0.91667 0.33333 0.75000 0.66667 0.58333 0.50000 0.91667 0.33333 0.91667 0.58333 0.83333 0.83333 0.50000 0.00000 0.41667 0.83333 0.50000 0.58333 0.58333 0.50000 0.58333 0.58333 0.25000 0.33333 0.75000 0.50000 0.83333

0.58333 0.83333 1.00000 1.00000 0.58333 0.83333 0.75000 0.66667 0.58333 1.00000 0.25000 1.00000 0.66667 0.91667 0.91667 0.58333 0.41667 0.00000 0.91667 0.58333 0.66667 0.33333 0.25000 0.66667 0.33333 0.50000 0.58333 0.83333 0.58333 0.91667

0.83333 0.41667 0.25000 0.41667 1.00000 0.91667 0.83333 0.41667 0.83333 0.41667 0.83333 0.25000 0.75000 0.50000 0.50000 0.83333 0.83333 0.91667 0.00000 0.83333 0.75000 1.08333 1.00000 0.75000 1.08333 0.91667 1.00000 0.91667 0.50000 0.33333

0.33333 0.75000 0.91667 0.91667 0.66667 0.75000 0.66667 0.58333 0.33333 0.91667 0.50000 0.91667 0.58333 0.83333 0.83333 0.16667 0.50000 0.58333 0.83333 0.00000 0.58333 0.75000 0.66667 0.58333 0.75000 0.58333 0.66667 0.75000 0.50000 0.83333

0.58333 0.66667 0.83333 0.83333 0.75000 0.33333 0.25000 0.50000 0.58333 0.83333 0.58333 0.83333 0.33333 0.75000 0.75000 0.58333 0.58333 0.66667 0.75000 0.58333 0.00000 0.83333 0.75000 0.33333 0.83333 0.66667 0.75000 0.33333 0.41667 0.75000

0.75000 1.00000 1.16667 1.16667 0.75000 1.00000 0.91667 0.83333 0.75000 1.16667 0.41667 1.16667 0.83333 1.08333 1.08333 0.75000 0.58333 0.33333 1.08333 0.75000 0.83333 0.00000 0.25000 0.83333 0.16667 0.66667 0.75000 1.00000 0.75000 1.08333

0.66667 0.91667 1.08333 1.08333 0.66667 0.91667 0.83333 0.75000 0.66667 1.08333 0.33333 1.08333 0.75000 1.00000 1.00000 0.66667 0.50000 0.25000 1.00000 0.66667 0.75000 0.25000 0.00000 0.75000 0.25000 0.58333 0.66667 0.91667 0.66667 1.00000

0.58333 0.66667 0.83333 0.83333 0.75000 0.50000 0.41667 0.50000 0.58333 0.83333 0.58333 0.83333 0.16667 0.75000 0.75000 0.58333 0.58333 0.66667 0.75000 0.58333 0.33333 0.83333 0.75000 0.00000 0.83333 0.66667 0.75000 0.50000 0.41667 0.75000

0.75000 1.00000 1.16667 1.16667 0.75000 1.00000 0.91667 0.83333 0.75000 1.16667 0.41667 1.16667 0.83333 1.08333 1.08333 0.75000 0.58333 0.33333 1.08333 0.75000 0.83333 0.16667 0.25000 0.83333 0.00000 0.66667 0.75000 1.00000 0.75000 1.08333

0.58333 0.83333 1.00000 1.00000 0.25000 0.83333 0.75000 0.66667 0.58333 1.00000 0.41667 1.00000 0.66667 0.91667 0.91667 0.58333 0.25000 0.50000 0.91667 0.58333 0.66667 0.66667 0.58333 0.66667 0.66667 0.00000 0.25000 0.83333 0.58333 0.91667

0.66667 0.91667 1.08333 1.08333 0.16667 0.91667 0.83333 0.75000 0.66667 1.08333 0.50000 1.08333 0.75000 1.00000 1.00000 0.66667 0.33333 0.58333 1.00000 0.66667 0.75000 0.75000 0.66667 0.75000 0.75000 0.25000 0.00000 0.91667 0.66667 1.00000

0.75000 0.83333 1.00000 1.00000 0.91667 0.16667 0.25000 0.66667 0.75000 1.00000 0.75000 1.00000 0.50000 0.91667 0.91667 0.75000 0.75000 0.83333 0.91667 0.75000 0.33333 1.00000 0.91667 0.50000 1.00000 0.83333 0.91667 0.00000 0.58333 0.91667

Table D.9: The auto generated distance matrix with 30 species.

85

86

Distance Matrices

Figure D.1: The phylogenetic tree of the generated distance matrix in table D.9. The figure was created using Phy-fi [8].

Bibliography [1] Thomas Bäck. Self-adaptation in genetic algorithms. In F. Varela and P. Bourgine, editors, The 1st European Conference on Artificial Life, pages 263–271, Cambridge, MA, 1992. MIT Press. [2] M. J. Brauer, M. T. Holder, L. A. Dries, D. J. Zwickl, P. O. Lewis, and D. M. Hillis. Genetic algorithms and parallel processing in maximumlikelihood phylogeny inference. Molecular Biology and Evolution, 19(10): 1717–1726, 2002. [3] David Bryant and Peter Waddell. Rapid evaluation of least-squares and minimum-evolution criteria on phylogenetic trees. Molecular Biology and Evolution, 15(10):1346–1359, 1998. [4] C. Cotta and P. Moscato. Inferring phylogenetic trees using evolutionary algorithms. In J.J. Merelo, P. Adamidis, H.-G. Beyer, J.-L. FernándezVillaca nas, and H.-P. Schwefel, editors, Parallel Problem Solving From Nature VII, volume 2439 of Lecture Notes in Computer Science, pages 720–729. Springer-Verlag, Berlin, 2002. [5] Ágoston Endre Eiben, Robert Hinterding, and Zbigniew Michalewicz. Parameter control in evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 3(2):124–141, July 1999. [6] M. Farach, S. Kannan, and T. Warnow. A robust model for finding optimal evolutionary trees. Algorithmica, 13:155–179, 1995. [7] Joseph Felsenstein. Inferring Phylogenies. Sinauer Associates, Inc., 2004. ISBN 0-87893-177-5. [8] Jakob Fredslund. Phy-fi: fast and easy online creation and manipulation of phylogeny color figures. BMC Bioinformatics, 7:315, 2006. [9] Jens Gottlieb, Bryant A. Julstrom, Günther R. Raidl, , and Franz Rothlauf. Prüfer numbers: A poor representation of spanning trees for evolutionary search. In GECCO 2001: Proceedings of the Genetic and Evolutionary Computation Conference, pages 343–350, San Francisco, California, USA, July 7-11 2001. Morgan Kaufmann. 87

88

Bibliography [10] Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1999. ISBN 0-521058519-8. [11] Randy L. Haupt. Optimum population size and mutation rate for a simple real genetic algorithm that optimizes array factors. In Antennas and Propagation Society International Symposium, volume 2, pages 1034–1037. IEEE, 2000. [12] P. O. Lewis. A genetic algorithm for maximum-likelihood phylogeny using nucleotide sequence data. Molecular Biology and Evolution, 15(3):277– 283, 1998. [13] H. Matsuda. Protein phylogenetic inference using maximum likelihood with a genetic algorithm. Pacific Symposium on Biocomputing ’96, pages 512–523, 1996. [14] Zbigniew Michalewicz and David B. Fogel. How to Solve It: Modern Heuristics. Springer-Verlag, 2nd edition, 2004. ISBN 3-540-22494-7. [15] Masatoshi Nei and Sudhir Kumar. Molecular Evolution and Phylogenetics. Oxford University Press, Inc., 2000. ISBN 0-19-513585-7. [16] Leon Poladian and Lars S. Jermiin. What might evolutionary algorithms (ea) and multi-objective optimisation (moo) contribute to phylogenetics and the total evidence debate? In GECCO 2004: Proceedings of the Genetic and Evolutionary Computing Conference, volume 3102 of Lecture Notes in Computer Science, Seattle, USA, June 27–30 2004. Springer. [17] T. H. Reijmers, R. Wehrens, and L. M. C. Buydens. Quality criteria of genetic algorithms for construction of phylogenetic trees. Journal of Computational Chemistry, 20(8):867–876, 1999. [18] Andrey Rzhetsky and Masatoshi Nei. A simple method for estimating and testing minimum-evolution trees. Molecular Biology and Evolution, 9 (5):945–967, 1992. [19] Naruya Saitou and Masatoshi Nei. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425, 1987. [20] Alexei Skourikhine. Phylogenetic tree reconstruction using self-adaptive genetic algorithm. In BIBE ’00: Proceedings of the 1st IEEE International Symposium on Bioinformatics and Biomedical Engineering, pages 129– 134, Washington, DC, USA, 2000. IEEE Computer Society. ISBN 07695-0862-6.

Bibliography [21] Alexandros Stamatakis. An efficient program for phylogenetic inference using simulated annealing. In 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05), page 198.2, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2312-9. [22] Rasmus K. Ursem. Models for Evolutionary Algorithms and Their Applications in System Identification and Control Optimization. PhD thesis, Department of Computer Science, University of Aarhus, April 2003. [23] Ron Wehrens, Ernö Pretsch, and Lutgarde M. C. Buydens. Quality criteria of genetic algorithms for structure optimization. Journal of Chemical Information and Computer Sciences, 38(2):151–157, 1998. [24] Bang Ye Wu, Kun-Mao Chao, and Chuan Yi Tang. Approximation and exact algorithms for constructing minimum ultrametric trees from distance matrices. Journal of Combinatorial Optimization, 3(2):199–211, 1999.

89

List of Tables 5.1 5.2 5.3

Example of using roulette wheel selection . . . . . . . . . . . . . . The selection probabilities of the example in figure 5.6 . . . . . . . The various GA operators . . . . . . . . . . . . . . . . . . . . . . .

28 29 34

7.1 7.2 7.3 7.4

Time used for tuning . . . . . . . . . . . . . . Running times of impatient and one-fifth rule Running time for CR1, CR2 and SA . . . . . Unique individuals in SA, CR1 and CR2 . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

52 56 62 62

D.1 D.2 D.3 D.4 D.5 D.6 D.7 D.8 D.9

Results and time usage for the neighbor joining method A4_EXTRA distance matrix . . . . . . . . . . . . . . . . . AAL_decarboxy distance matrix . . . . . . . . . . . . . . ABA_WDS distance matrix . . . . . . . . . . . . . . . . . . ABG_transport distance matrix . . . . . . . . . . . . . . Adeno_E1A distance matrix . . . . . . . . . . . . . . . . Adeno_E1B_55K distance matrix . . . . . . . . . . . . . . Adeno_E3_14_5 distance matrix . . . . . . . . . . . . . . Auto generated distance matrix . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

81 82 82 83 83 84 84 85 85

90

. . . .

. . . .

. . . .

. . . .

. . . .

List of Figures 1.1

An introduction phylogenetic tree . . . . . . . . . . . . . . . . . . .

1

2.1 2.2 2.3 2.4 2.5

Representing multifurcating rooted trees . . . . . . . Counting phylogenetic trees . . . . . . . . . . . . . . The Rastrigin benchmark function . . . . . . . . . . The difference between exploration and exploitation A few relevant search heuristics . . . . . . . . . . . .

. . . . .

4 5 7 8 9

3.1 3.2

Topology represented by distance matrix and Prüfer sequence . . . An example tree topology . . . . . . . . . . . . . . . . . . . . . . .

14 16

4.1

Illustration of the neighbor joining method

19

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10

Overview of a genetic algorithm . . . . . . . . . . . . . . . Abstract illustration of the tree search space . . . . . . . . Knowledge-based approach to create the initial population Example of 3-way tournament selection . . . . . . . . . . Example of the roulette-wheel selection . . . . . . . . . . Example of ranking selection selection . . . . . . . . . . . Illustration of PDG recombination . . . . . . . . . . . . . The mutation operator Swap . . . . . . . . . . . . . . . . The NNI method . . . . . . . . . . . . . . . . . . . . . . . The mutation operator SPR . . . . . . . . . . . . . . . . .

. . . . . . . . . .

22 23 24 26 27 29 31 32 32 33

6.1 6.2

Parameter control of a genetic algorithms . . . . . . . . . . . . . . The individuals used in self-adaptation . . . . . . . . . . . . . . . .

36 40

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8

Tuning tournament selection on ABA_WDS . . . . . . . Tuning tournament selection on A4_EXTRA . . . . . . Tuning ranking selection on A4_EXTRA . . . . . . . . Tuning ranking selection on Adeno_E1A . . . . . . . . Tuning NJ-based population on Adeno_E1A . . . . . Tuning NJ-based population on A4_EXTRA . . . . . . Tuning ranking selection on A4_EXTRA again . . . . . Results of one-fifth and impatient rule on A4_EXTRA .

46 48 48 49 50 50 51 54

91

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . . . .

. . . . . . . .

. . . . . . . .

92

List of Figures 7.9 7.10 7.11 7.12 7.13 7.14 7.15

Results of one-fifth and impatient rule on Adeno_E1A . . . . Self-adaptation on Adeno_E1A with random initialization . . Self-adaptation on Adeno_E1A with NJ-based initialization . Running self-adaptation, CR1 and CR2 on Adeno_E1B_55K Running self-adaptation, CR1 and CR2 on Adeno_E3_14_5 Running self-adaptation, CR1 and CR2 on AAL_decarboxy Running SA, CR1 and CR2 on ABG_transport . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

55 57 58 59 60 61 61

8.1 8.2 8.3

Representing topologies in a three dimensional coordinate system . Mutation strength . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel multi-population genetic algorithm . . . . . . . . . . . . .

66 67 68

C.1 Parameter setting in genetic algorithms . . . . . . . . . . . . . . .

80

D.1 The phylogenetic tree of the generated matrix . . . . . . . . . . . .

86

List of Algorithms 2.1 A.1 A.2 A.3 A.4 A.5

Pseudo code describing the genetic algorithm . . . Prune-Delete-Graft recombination . . . . . . . . . NeighborJoining, the neighbor joining algorithm Join, helper function for NeighborJoining . . . Encoding a Prüfer sequence . . . . . . . . . . . . . Decoding a Prüfer sequence . . . . . . . . . . . . .

93

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

10 73 74 74 75 75