The origin, evolution and structure of the protein world

Biochem. J. (2009) 417, 621–637 (Printed in Great Britain) 621 doi:10.1042/BJ20082063 REVIEW ARTICLE The origin, evolution and structure of the pr...
Author: Reynold Berry
1 downloads 0 Views 2MB Size
Biochem. J. (2009) 417, 621–637 (Printed in Great Britain)

621

doi:10.1042/BJ20082063

REVIEW ARTICLE

The origin, evolution and structure of the protein world Gustavo CAETANO-ANOLLE´ S*1 , Minglei WANG*, Derek CAETANO-ANOLLE´ S*† and Jay E. MITTENTHAL† *Department of Crop Sciences, University of Illinois at Urbana-Champaign, 1101 W. Peabody Drive, Urbana, IL 61801, U.S.A., and †Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, 601 S. Goodwin Avenue, Urbana, IL 61801, U.S.A.

Contemporary protein architectures can be regarded as molecular fossils, historical imprints that mark important milestones in the history of life. Whereas sequences change at a considerable pace, higher-order structures are constrained by the energetic landscape of protein folding, the exploration of sequence and structure space, and complex interactions mediated by the proteostasis and proteolytic machineries of the cell. The survey of architectures in the living world that was fuelled by recent structural genomic initiatives has been summarized in protein classification schemes, and the overall structure of fold space explored with novel bioinformatic approaches. However, metrics of general structural comparison have not yet unified architectural complexity using the ‘shared and derived’ tenet of evolutionary analysis. In contrast, a shift of focus from molecules to proteomes and a census of protein structure in fully sequenced genomes were able to uncover global evolutionary patterns in

the structure of proteins. Timelines of discovery of architectures and functions unfolded episodes of specialization, reductive evolutionary tendencies of architectural repertoires in proteomes and the rise of modularity in the protein world. They revealed a biologically complex ancestral proteome and the early origin of the archaeal lineage. Studies also identified an origin of the protein world in enzymes of nucleotide metabolism harbouring the P-loop-containing triphosphate hydrolase fold and the explosive discovery of metabolic functions that recapitulated well-defined prebiotic shells and involved the recruitment of structures and functions. These observations have important implications for origins of modern biochemistry and diversification of life.

INTRODUCTION

wound helices that locally distort the bond geometry of the polypeptide backbone (310 -helix, α-helix, π-helix and polyproline II-helix) and extended chain segments called β-strands that can establish long-range interactions and form β-sheets in parallel and antiparallel arrangements (often curved into open and closed barrel structures). These conformational elements (‘helical’ and ‘sheet’), originally proposed by Pauling and Corey [1] as building blocks of proteins, are defined fundamentally by hydrogenbonding interactions of closely or distantly related regions of the polypeptide chain and make up approx. two-thirds of protein structure. They are separated by loop regions (turns and coils), which are more or less rigid stretches of the backbone that delimit their direction in space. An example is the β-hairpin, a reverse turn that links two adjacent strands and forms an antiparallel β-sheet. Linderstrøm-Lang and Schellman in the 1950s [2] realized that protein structure was hierarchical and proposed four levels of structural organization (complexity): (i) primary structure, the sequence of amino acids linked by peptide bonds; (ii) secondary structure, the hydrogen-bonding patterns that give rise to helix and sheet elements in the fold; (ii) tertiary structure, the actual fold of the molecule stabilized mainly by side-chain interactions of elements of secondary structure; and (iv) quaternary structure, the aggregation of separate polypeptide chains into a supramolecular biological unit. Function fully manifests when these four levels of complexity are achieved. The recognition that aspects in the structure of proteins are redundant and modular (see below) led to the addition of new levels: (i) supersecondary structures, recurrent motifs of secondary structure, such as α-α-hairpins, β-β-hairpins

Protein molecules are vital components of life. Together with functional RNA, they are primarily responsible for the many biological activities of the cell. Proteins define the enzymatic chemistries and transport processes characteristic of metabolic pathways, regulate gene expression and many other molecular functions, are involved in signal transduction, and make up the actual molecular and cellular machinery that fuels life. They are highly diverse and embed hierarchically many layers of molecular organization. Their evolution is complex and constrained by aspects of molecular structure, thermodynamics and function. In the present review, we examine the structure of the modern protein world and discuss how evolutionary genomics and structural bioinformatics have helped to dissect the origin and history of modern proteins. We also discuss how the discovery of structure and function in the contemporary protein world has affected the distribution of molecules in proteomes. PROTEIN STRUCTURE

Polypeptide chains fold into highly ordered architectures that embed protein function. These folds minimize the energy conformations of individual amino acid residues in the chain, maximize hydrogen-bonding of polar groups and form compact and well-packed 3D (three-dimensional) atomic structures that bury hydrophobic residues away from the aqueous environment. Physically, they represent spatial arrangements of more or less

Key words: evolution, fold superfamily, organismal diversification, protein domain, proteome, tripartite world.

Abbreviations used: CATH, Class, Architecture, Topology, and Homologous superfamily; F, fold; FF, fold family; FSF, fold superfamily; HMM, hidden Markov model; Hsp, heat-shock protein; HSR, heat-shock response; nd , node distance; PDUG, protein domain universe graph; SCOP, Structural Classification of Proteins; 3D, three-dimensional. 1 To whom correspondence should be addressed (email [email protected]).  c The Authors Journal compilation  c 2009 Biochemical Society

622

Figure 1

G. Caetano-Anoll´es and others

Levels of molecular organization in the protein world

The hierarchy and complexity of proteins is illustrated with the ATP synthase complex, a highly abundant protein ensemble responsible for ATP synthesis in the cell. The 600 kDa complex can be separated into two subunits, F1 and Fo , which can be studied individually. Transmembrane proton gradients drive rotation of the C-subunit ring of Fo , which then propels rotation of the central stalk and the F1 head of the complex. This rotation causes conformational changes in F1 active sites that result in ATP synthesis/hydrolysis. The α-subunit of the complex has three domains, the central of which has a P-loop hydrolase fold that is highlighted in the Figure. Different levels of structure occur at different levels of resolution (scale in A˚, where 1 A˚ = 0.1 nm) and at different rates in evolution. For example, the discovery of ∼ 2 × 1024 total sequences (considering homogenization of sequence diversity at population level; see the text) suggest that new genotypes are produced on Earth at a + level of fractions of a microsecond. A similar argument can be used with secondary structure. The average length of a helical segment is 10 + − 2 residues and of a sheet element 5 − 1 residues, and + their average number is 6 + S1 at http://www.BiochemJ.org/bj/417/bj4170621add.htm). Then the maximum number of possible − 2 and 7 − 3 per protein chain respectively (see Supplementary Table permutations of elements of different length (∼ 7 residues) in groups of five to ten is 2.8 × 108 (i.e. 710 ). If all these permutations are accessible, the rate of maximum discovery approximates 0.1 structural arrangement per year. Rates of discovery at higher structural levels use frequentist arguments that relate the number of known structural arrangements to time. For example, the discovery of the ∼ 4 × 104 distinct domains indexed in SCOP [82,83] spread uniformly over a ∼ 4 billion year timeline renders a frequency of discovery of one domain every 105 years. PDB codes used: 1BNF and 1QO1.

and β-α-β-structures, sometimes repeated in tandem (e.g. in leucine-rich repeat proteins) or producing structures with or without internal symmetry (e.g. β-α-β-structures in TIM barrels) [3]; (ii) protein domains, compact units within the fold that act as structural modules and appear singly or in combination with other domains in multidomain proteins [4]; and (iii) multiprotein complexes, heteromultimeric assemblies of functionally related proteins that act as high-order functional units (e.g. molecular machines such as the ribosome, the proteasome and the dynein complex) [5]. Figure 1 describes how hierarchical complexity correlates with degree of molecular detail and time needed to develop each level of organization in evolution. Note, however, that levels of organizational complexity can be blurred by how structural elements are defined and new levels may arise with increased knowledge of protein structure and drivers of protein evolution. Similarly, the rate of change associated with structure is notoriously difficult to estimate. However, and within orders of magnitude, more complex structures arise through accumulation of many changes at lower levels and therefore take considerably more time to arise in evolution. Linderstrøm-Lang’s laboratory studies of protein denaturation and proteolysis and the accessibility of protein bonds that are buried also helped shape the idea that protein structure was highly dynamic [6]. This ultimately materialized in Anfinsen’s thermodynamic hypothesis of folding that postulates the native structure of a protein results from spontaneous refolding of denatured states [7] into the thermodynamically stable structure [8], linking primary to tertiary structure in proteins. It is now apparent that proteins indeed achieve native structure quickly and efficiently through a myriad of conformational changes that are influenced by the solvent [9–11]. This involves a complex interplay of simple pairwise and co-operative interactions that tend to stabilize protein structure towards its native state, a state in which frustration (conflicting interactions) is minimal and a ground energetic state and funnel topography dominates the local folding landscape (energy surface; Figure 2A). In reality, folding appears to materialize through a progressive organization of an ensemble of partially folded structures that resembles a  c The Authors Journal compilation  c 2009 Biochemical Society

rugged funnel, with trajectories defined by a series of steps of local optimization that minimize the free energy accessible to the polypeptide chain and conflicting energy contributions from the relative position of individual residues and solvent [9–11]. This complex interplay sometimes occurs in the presence of kinetic traps that complicate the landscape, characteristic of rugged funnels (Figure 2A). Folding should be regarded as a transition of disorder to order in a global optimization process. The ‘zipping and assembly’ hypothesis captures for example the essence of this process, describing microscopic routes of folding that start from a polypeptide sequence and materialize in a time series of smaller and smaller conformation ensembles [12]. These routes involve local metastable structures in the protein chain that progressively assemble into more global ones. The energy landscape has inspired physics-based modelling with semi-empirical atomic force fields that fold molecules in a computer and provide ab initio understanding of the forces and dynamics that govern the folding process. Great progress has been made, for example with helical and β-hairpin peptides and small proteins, using modern force fields, Boltzmann sampling and/or other considerations [12–17]. Pathways of folding and unfolding derived from Molecular Dynamics simulations are now supported by experimentation with analysis of transition states, determination of intermediates with NMR spectroscopy and denatured states. One example is the folding and unfolding of the three-helix bundle protein from the Engrailed homeodomain of Drosophila melanogaster at atomic resolution [18,19]. This multidimensional energy landscape provides a statistical view of the energetics of protein conformations, but also manifests at evolutionary levels, as single-molecule structure variants (and associated conformational ensembles) generated by mutation are culled by natural selection and other evolutionary constraints (e.g. self-organization [20]). This landscape becomes evolutionary when fitness values are assigned to phenotypes (Figure 2B). Here, fitness embodies the advantageous contribution of individual molecules to the reproductive success of organismal lineages. When biopolymers mutate, they embark on an exploration of the space of possible sequences (defined by a Hamming metric

Evolution of the protein world

Figure 2

623

Current paradigms on folding and evolution of proteins

(A) Folding funnel-shaped representation of the energy landscape that describes the transition of protein conformations from disorder to order. The free energy of a protein is displayed as a function of the number of conformations at each energy level (density of states) that are derived from the partition function and describe the topological arrangements of the polypeptide chain in space that are possible. Proteins fold co-operatively by channelling protein folding intermediates (non-native states) downhill into the funnel and achieving the native state at its base, after avoiding the kinetic traps of the rugged landscape. Note, however, that proteins do not remain folded. Native proteins are slightly more stable than their denatured states so that they fold and unfold every few minutes, setting the pace for change in the funnel. (B) Mapping of sequence (genotype) space into structure (phenotype) space and then into fitness. The first map is many-to-one and unfolds by single mutational steps in sequence space. The second map assigns real numbers to structures given some function that distils the phenotype. (C) Evolutionary dynamic representation of protein evolution. The mapping of mutating protein sequences (genotypes) into structures (phenotypes) defines neutral sets in sequence space, ensembles of sequences that fold into a given native structure and neutral networks, subset with sequences that are tightly linked by series of single point mutations. Neutral nets corresponding to four different protein folds are coloured differently in a planar representation of the multidimensional space of sequences. Mutation causes sequences to drift along the neutral nets. However, the search for thermodynamic and kinetic folding optimality (described by a third dimension) tailors evolutionary trajectories keeping them within space attractors for individual folds (illustrated as funnels). When mutational trajectories (paths in the graph) reach new neutral networks, new attractors and new folds are discovered. An animated version of this Figure can be seen at http://www.BiochemJ.org/bj/417/0621/bj4170621add.htm.

of elemental mutational moves) and associated structures and functions. This exploration takes the form of adaptive walks in sequence space that optimize thermodynamic, kinetic and mutational features in molecules. The mapping of sequence (genotype) into structure (phenotype) has been shown to be tractable in RNA [21] and also in proteins [22] and has three fundamental properties: (i) there are many more sequences than structures (i.e. the sequence-to-structure map is highly degenerate); (ii) few common, but many rare, structures materialize in structure space; and (iii) extensive neutral networks that percolate sequence space define common structures and structural neighbourhoods [23,24]. Within these neutral networks (anticipated by Maynard Smith [25] and in response to Salisbury [26]), structure is impervious to mutational change at the sequence level and, because the distribution of sequences that fold into the same structure (shape) is approximately random, the mapping has ‘shape space covering’ properties. This means that all structures can materialize (are accessible) within relatively few mutational changes in sequence space. This property is especially true for RNA and has been confirmed experimentally using functional molecular switches that have been engineered by in vitro evolution [27]. The existence of neutral networks and shape space covering has also been predicted for polypeptides [28], paraphrasing conclusions from lattice models with simplified alphabets [29– 31]. In these studies, independent adaptive walks in sequence space can produce a given structure despite lacking significant sequence similarity, matching the recurrent observation that sometimes seemingly unrelated sequences can harbour a given fold [32]. At the same time, and because of shape space covering, sequences that fold into completely different structures may differ by a few critical amino acid residues. Consequently, extensive neutral networks enable the efficient exploration of sequence space, whereas shape space covering ensures a constant rate of structural discovery. These properties match, for example, some recent in vitro evolution experiments [33] that show extensive regions in natural proteins exhibit functions refractory to

mutational change [34] and demonstrate that discovery of function in random peptide libraries is facile (e.g. [35,36]). However, the sequence-to-structure mapping of proteins is much more complex and its landscape is ‘holey’ when compared with RNA, with proteins folding into native states missing in vast segments of sequence space. Although the neutrality of protein sequence space is much higher than that of RNA (> 90 % of single amino acid substitutions are neutral [37]), protein structures appear to concentrate in dense clusters [38,39], whereas RNA structures spread through sparsely connected networks [40,41]. Under a ‘superfunnel’ paradigm supported by experimentation [42,43], protein sequences drift along neutral networks and are sometimes trapped into funnels (Figure 2C), defined by sequences that are mutationally more stable (they tolerate the largest number of mutations) and, at the same time, are thermodynamically more stable [37,39]. These ‘attractors’ in neutral space are sometimes replaced by more fit ones through smooth transitions mediated by excited states that tend to occur between similar structural phenotypes and genotypes [44] (Figure 2C). Smooth adaptive walks of this kind [25] explain enzyme promiscuity [45] and reconcile recent experiments that show that proteins optimized for novel function arise before the original function is lost [46,47]. They can also explain gene duplication and divergence and the effect of epistatic thresholds of stability that buffer the effects of deleterious mutations on fitness [48]. The superfunnel paradigm therefore links the energetic landscape of folding with the evolutionary dynamics of molecules in percolating neutral networks. Selection for compact and stable fold architectures is also maintained at higher levels of organization by more complex cellular infrastructure, which adds further evolutionary constraints on protein architecture. For example, the HSR (heat-shock response) is a fundamental cytoplasmic mechanism common to the three domains of life (Archaea, Bacteria and Eukarya) [49]. When subjected to temperature increases, five groups of Hsps (heat-shock proteins) are induced (Hsp100s, Hsp90s, Hsp70s,  c The Authors Journal compilation  c 2009 Biochemical Society

624

G. Caetano-Anoll´es and others

Hsp60s and small Hsps). These include chaperones, proteases, ATPases and DNA-repair proteins that mend damage and mediate non-covalent folding, unfolding, assembly and disaggregation of proteins. Hsp synthesis is of crucial importance for thermotolerance of organisms such as hyperthermophilic archaea, which appears to exhibit minimal, but highly tailored, protein-folding systems [50]. Within these groups of molecules, chaperonins are megadalton ring assemblies that mediate ATP-dependent protein folding to the native state (e.g. the bacterial chaperonin GroEL and its co-chaperonin GroES) through complex allosteric mechanisms [51]. Prefoldins deliver nascent unfolded proteins to these cytosolic chaperones as they exit the ribosome, establishing specific interactions with actin and tubulin in eukaryotes [52]. Since proteins exhibit a generic tendency to aggregate in the high macromolecular concentrations of intracellular compartments (molecular crowding) [53], proteins that unfold or remain unfolded are tagged and degraded by the ubiquitin–proteasome proteolytic pathway [54]. However, in eukaryotes, more complex systems guarantee the correct folding of a protein. These proteostasis control systems regulate protein concentration, the conformation of folds and complexes, and cell localization [55–57]. They involve interactions between the folding polypeptide and cellular components such as chaperones, co-chaperones, folding enzymes and components of small-molecule metabolism that stabilize the folded state and stress sensors, including the HSR and the UPR (unfolded protein response) [57]. This adds an additional layer of complexity to the already difficult folding problem and additional evolutionary constraints to the discovery of protein architecture.

PROTEIN DIVERSITY AND THE HIERARCHICAL STRUCTURE OF THE PROTEIN WORLD

Proteins are covalently bonded linear heteropolymers made up of 20+ amino acid monomers with a specific primary sequence of side chains spaced at regular intervals. From this perspective, the roughly 103 –105 protein sequences per genome that are encoded in the genomes of the ∼ 107 –108 species that exist on Earth [58], most of which are microbial [59], cover necessarily only a minute fraction (∼ 1010 –1013 variants) of the enormous permutational space defined by amino acid sequence (∼ 10321 –10469 possible arrangements in sequence space), given recent estimates of average protein length in genomes [60,61]. In these calculations we assume there is no intraspecies variation, even though it is unlikely that members of a given reproducing population will be identical. In fact, if we consider that the (4–6) × 1030 prokaryotic microbial cells in our planet (which account for ∼ 70 % of life in certain habitats) have turnover rates of ∼ 8 × 1029 cells per year [59] and that mutations in proteins occur in clock-like fashion at rates of ∼ 4 × 10− 7 per microbial cell and per generation [62], then we would expect an upper boundary of ∼ 2 × 1032 total mutational amino acid changes in microbial proteins in the ∼ 4-billion-year-long history of life, which is still a minute fraction of sequence space (even if these concentrate in sequences that fold successfully). This limited molecular exploration of sequence space has nevertheless encountered considerable diversity at higher levels of structural organization, mostly because of the neutral net and shape space covering properties we discussed above. Whereas the rate of discovery of new sequence genotypes on Earth appears to occur at incredible pace and generate considerable sequence diversity, rates at higher levels of protein organization decrease progressively and in a substantial manner (Figure 1). Sequence variants develop within fractions of microseconds. However, secondary structure variants take considerably longer to be discovered,  c The Authors Journal compilation  c 2009 Biochemical Society

whereas complex 3D structures arise once in hundreds of millions of years. This is an expected outcome. Higher structural levels are linked directly to function and are therefore the subject of natural selection and strong evolutionary constraint [63,64]. Sequence genotypes have a limited alphabet and change constantly by mutation, making them poor repositories of molecular history. In fact, the repeated accumulation of substitutions in nucleotide sites (site saturation) can erase evolutionary history at intermediate and deep evolutionary timescales [65–67]. In contrast, structural phenotypes have more complex alphabets that define function directly or through interactions of substructural, molecular and supramolecular components that are collectively responsible for function (e.g. in molecular ensembles), all of which are often carefully culled by natural selection. The effects of selection are consequently stronger at this level than at the genotype level and structural phenotypes are generally left unchanged over short, intermediate and long timescales. However, proteins evolve at vastly different rates, and recent studies suggest that this is due to differences in expression levels, functional roles and intra- and inter-molecular interactions [43,68–73]. For example, increases in the density of contacts (fraction of buried sites) in domains or entire proteins tend to increase evolutionary rates [73]. In contrast, increases in the number of binding interfaces of multiinteracting proteins tend to decrease rates [71]. Interestingly, positively selected amino acid sites were found preferentially located on the exposed surface of proteins [72]. Within individual proteins, different regions of the molecules are differentially constrained and were found to be quantitatively stable over billions of years of divergence [74]. Most notably, active sites and residues important for structural maintenance tend to evolve slowly and were refractory to mutation. However, the relationship between protein conservation and function is complex, especially when molecular redundancy, strength of natural selection and genome structure are taken into consideration [75]. Advances in comparative and structural genomics offer unprecedented opportunities to understand proteomic complexity and provide insights into the diversity and structure of the protein world [76]. The number of protein sequences and structures has expanded significantly in the last few years and its organization is clearly hierarchical (Figure 3). There are currently (as of November 2008) 875 completely sequenced genomes contributing to ∼ 6 million sequences. However, only a fraction of sequences are well annotated, and the number of unique entries at lower levels of structural organization continues to increase exponentially; the protein world remains uncharted at these levels [77]. In contrast, the number of new folds that are encountered every year is decreasing considerably, supporting the idea that the repertoire of architectural designs is finite (perhaps ∼ 1500 folds). A recent attempt to recreate all possible protein folds by ab initio folding from short homopolymeric sequences revealed all constructs matched folds in solved structures, and vice versa; all natural single-domain structures had analogues in the model set [78]. This suggests that our knowledge of single-domain folds is probably complete. In order to make sense of increasing information, a number of bioinformatic strategies of sequence and structural comparison have led to the creation of a wide range of protein classification schemes, all of which aim to group evolutionarily related proteins [79]. These catalogues organize sequences and proteins of known structure (currently described by ∼ 54 000 Protein Data Bank entries) into taxonomies in an attempt to provide a global evolutionary view of the protein world. The first taxonomies described were originally based on the concept of the protein domain [80] and most modern classifications are still organized around this structural level [79]. This is predicated on the premise that domains are compact and

Evolution of the protein world

Figure 3

625

Progress in the experimental discovery of sequences and structures

(A) Protein architectures can be defined at different levels of protein hierarchy, using, for example, taxonomial classifications such as SCOP [82,83], with categories described with alphanumeric labels and identifiers. Currently, sets of approx. 1000 Fs, 1800 FSFs and 3500 FFs describe the world of proteins. (B) The continuous increase of the available numbers of sequences from the highly curated UniProtKB, protein structures from the PDB, and F, FSF and FF architectures in SCOP. The numbers of completely sequenced genomes that have been published (indexed in the Genomes Online Database [198]) have increased continuously from 1997 to 2007. Only the latest data were used if some databases had more than one release available in one year. Note that UniProtKB entries represent only a fraction of the ∼ 6 million sequences in UniProtKB/TrEMBL. (C) Proteins have physical structures that were designed and constructed by Nature (architecture) defined by the folding of the sequence at F, FSF or other levels of the structural hierarchy (domain structure) and by how domains combine with others (domain organization).

more-or-less independent globular folding elements, establish more abundant intradomain than interdomain residue contacts, and recur in different structural contexts (i.e. they act as modules, appearing singly or in combination with other domains). The recurrence concept is supported by a comparative framework on the basis of homology relationships and is fundamental. It defines the domain as an evolutionary unit of classification. Approx. 30 popular domain classifications based on sequence and/or structure are currently available; they use patterns and/or profiles in sequence to build libraries of domain families or establish distant relationships using structure comparisons. The Pfam database of multiple sequence alignments and HMMs (hidden Markov models), for example, is a comprehensive resource for the identification of domain families, repeats and motifs [81]. It provides two levels of curation, one based on automated domain sequence alignments (Pfam-B) and the other extended by HMM-based profile searches and literature analyses (Pfam-A), which serve as seeds for the iterative construction of HMMs. In contrast, SCOP (Structural Classification of Proteins) is a high-quality taxonomical resource that assigns domain boundaries manually at the structural level and applies the recurrence concept rigorously [82,83]. SCOP domains that are closely related at the sequence level (generally expressing > 30 % pairwise amino acid residue identities) are pooled into fold families (FFs), FFs sharing functional and structural features suggestive of a common evolutionary origin are unified further into fold superfamilies (FSFs), and FSFs that share similarly arranged and topologically connected secondary structures are grouped further into protein folds (Fs). Fs are then grouped into protein classes according to organization of secondary structure in the fold, defining the major α/β, α + β, all-α, all-β, small and multidomain groups. This architectural hierarchy (Figure 3A) somehow mimics the relative numbers of sequences and structures that have been discovered (Figure 3B). Unlike SCOP, the CATH (Class, Architecture, Topology, and Homologous superfamily) classification of proteins uses expert systems that automate most

steps and classify domains that may or may have not been observed in other structural contexts [84,85]. CATH adds an additional hierarchical level (‘architecture’) over the fold classification (‘topology’) that describes the 3D arrangement of secondary structure but not its connectivity. A final example of structural classification is the DALI Dictionary, a fully automated non-hierarchical structural alignment system that uses domain recurrence to identify domains and provide lists of structural neighbours [86]. Interestingly, a comparative analysis of the SCOP, CATH and DALI taxonomies revealed remarkable agreement of protein assignments at fold (75 %) and superfamily (80 %) levels, with discrepancies attributed to different thresholds or manual curation [87]. In recent years, the different domain classifications have been consolidated by cross-listing and integration. For example, the InterPro consortium integrates protein classifications (including Pfam, SCOP and CATH) and maps protein families, domains, repeats and identifiable features of known proteins on to sequences in TrEMBL and Swiss-Prot [88]. Although taxonomies provide the framework needed to understand protein diversity, the definition of structure and associated functions remains challenging [89]. Protein architecture, the ‘fundamental build’ [the αρχι- (archi-) τ ε´ κτ ων (tekton)] of a protein, is modular (Figure 3C). Domains with different 3D structures (domain structures) combine with others in complex arrangements (defined here as domain organization). Domain structures associate with functions that are sometimes carried into the multidomain arrangement to increase enzyme specificity, provide links between other domains or regulate functional activity [90]. However, domain boundaries are difficult to establish and common topological elements that make up the folding core sometimes account for less than half of domain sequence [91]. Moreover, some CATH folds exhibit 3-fold variation in the number of secondary structures, and certain superfamilies show that secondary-structure embellishments often associate with change of function [91]. Peripheral regions of secondary structure can differ in size and conformation,  c The Authors Journal compilation  c 2009 Biochemical Society

626

G. Caetano-Anoll´es and others

‘decorating’ the central folds of domains distinctively. Similarly, accretion of substructures around the core can result in functional diversity, as illustrated with the biochemistries that are linked to enzymes harbouring the thioredoxin-like fold [92]. To complicate matters, a measurement of how often fold substructures are shared by fold architectures (e.g. ‘gregariousness’) suggests some fold categories should be regarded as ‘neighbourhoods’ defined by how much structural overlap exists between them [93,94]. Some regions of the protein fold space therefore represent a continuum for certain architectural arrangements (sometimes linked by supersecondary motifs), whereas, in other regions, clearly distinct non-overlapping (discrete) topologies are observed. These regions can be best represented as a continuous and multidimensional environment [95]. Interestingly, detecting similarities between ligand-binding sites with a new structure–function comparison method tested the notion of a continuous fold space and revealed new evolutionary relationships across an existing discrete representation [96]. Finally, proteins can adopt multiple structures and functions, exhibiting conformational diversity and functional promiscuity. They can display ligand-independent conformational diversity, use structures to ‘moonlight’ different functions without involving their active sites or become promiscuous [45]. Chameleon sequences can adopt a distinct folded conformation under native conditions, and large-scale fold variations can alter topology in proteins [97]. This is complicated by the fluid nature of genome structure, which facilitates the rearrangement of domains [4,98]. These rearrangements are responsible for domains being both functional and structural subgenic modules that are highly plastic. Remarkably, protein structures are unevenly distributed in the world of proteins [99]. Genome surveys have shown that families and folds in genomes follow power-law distributions and exhibit scale-free properties [100–102]. This behaviour results in a few folds that are highly popular (‘superfolds’ with many families; e.g. TIM barrel folds are widely distributed in metabolism) and many that appear infrequently (‘mesofolds’ and ‘unifolds’) [103]. It also implies a preference for duplication of genes encoding folds that are already common, as summarized in models that account for duplication, acquisition and loss of genes [102] or describe birth–death–innovation processes [104–106]. Interestingly, fold frequency plots for the microbial superkingdoms Archaea and Bacteria have steeper decay slopes than those for Eukarya, showing there is a larger level of architectural redundancy in the proteomes of complex organisms [99,107]. However, folds shared by all superkingdoms and folds shared by Eukarya and Bacteria (generally the most ancient; see below) fitted Gaussianlike distributions characteristic of random graphs, suggesting the spread of folds across superkingdoms is complex [107]. In order to explain the uneven proteomic distribution of structures, a number of phenomenological and physics-based models have been proposed that focus on functional constraints, convergence of sequences into structures (‘designability’), or evolutionary dynamic considerations, some of which invoke evolutionary processes of convergence or divergence and the paradigms described in Figure 2. They have been reviewed recently [108] and will not be discussed here. In particular, statistical mechanic approaches to evolution of simple lattice model proteins provide interesting insights into the workings of real proteins. Most notable is a recent microscopic ab initio model that considers not only the fate of genes, but also the survival of organisms [109]. The model is based on the central assumption that the death rate of an organism is determined by the stability of the least stable of its lattice model proteins. Simulations reveal exponential population growth once favourable sequence–structure combinations are discovered and collapse of these precursors into selected fold architectures, which  c The Authors Journal compilation  c 2009 Biochemical Society

remain stable and abundant at timescales greater than organismal lifetime. The rise of protein families and superfamilies and powerlaw distributions that match distributions for real proteins arise as emergent properties of the physical model, which suggests new folds result from dominant folds by satisfying energetically favoured native conformations. This is provocative and clearly in line with emerging views of protein folding and evolution.

PROTEIN EVOLUTION IN FOLD SPACE

Almost 150 years ago, in his seminal book, Charles Darwin established evolution by common descent as the dominant scientific explanation of biological diversity and change [110]. The divergence of species was illustrated with branching histories of inheritance (phylogenies) that allowed inference of ancestral links and tested evolutionary hypotheses. Phylogenetic thinking remains fundamental in evolutionary bioinformatics today and diversity and change are still illustrated with phylogenetic trees, graphical and mathematical representations (with branches and reticulations) that portray how contemporary is common ancestry. These trees have been particularly useful in the comparative analysis of nucleic acid and protein sequences and have had an impact on each and every discipline of biology, including genome science and informatics. They seed a holistic future [111]. The evolutionary classification of protein domains has been based on sequence and structural homologies that make use of phylogenetic tools and advanced bioinformatic methods [79]. Protein families group together sequences that share a common ancestry, but generally do so with a low hierarchical granularity; the reliability of comparative methods break down when reaching the so-called ‘twilight zone’ of < 30 % sequence identity. However, change in protein structure is linked directly to change in biological function. This has been recognized by structural genomic initiatives that seek to characterize exhaustively the major building blocks of proteins, and both structure and function have aided phylogenetic analyses when sequences fail to unite distant family relatives. Evolutionary relationships have been inferred directly from the structure of protein molecules, generally using formal methods of phylogenetic reconstruction [112–117]. These methods have been limited to analysis of closely related architectures with backbones and secondary structures that can be more or less superimposed. However, global views of the protein world that establish evolutionary relationships at superfamily or fold level require more involved and systematic approaches of classification. One strategy is to compare all proteins with each other and plot relationships on existing protein fold space, with structural similarities visualized at low dimensional level [118]. For example, Gauss integrals that describe protein backbones as space curves were used to construct a 30-dimensional vector that was then projected on a plane, producing 2D (two-dimensional) maps with fold distributions matching CATH classification [119]. These maps divide structures belonging to α, β and αβ classes of CATH into distinct groups. Similarly, matrices of DALI alignment scores in pairwise backbone comparisons of SCOP folds produced 3D representations that clustered folds belonging to the α/β, α + β, all-α and all-β protein classes [120] and allowed construction of a structural map [121]. Note, however, that a simple plot of overall length of helical segments against strand segments was able to dissect these classes without resorting to complicated algorithms (Figure 4A and see Supplementary Figure S1 and Supplementary Table S1 at http://www.BiochemJ.org/ bj/417/bj4170621add.htm and an animated version of Figure 4(A) at http://www.BiochemJ.org/bj/417/0621/bj4170621add.htm). Typically, α/β folds have interspersed helix and strand secondary

Evolution of the protein world

Figure 4

627

Phylogenomic analysis and evolution of major structural classes of globular proteins

(A) Grouping of proteins in the all-α, all-β, α/β and α + β classes according to features of secondary structure. The average total length of segments of secondary structure in a peptide chain was calculated using DSSP [199] secondary structure assignments in proteins (61175 peptide chains) from all PDB entries in SCOP version 1.69. These features were calculated from chains belonging to the same SCOP fold for all folds. Plots compared each feature of secondary structure with each other. The Figure shows only comparison of average total length of α-helical and β-strand segments. Averages are described in Supplementary Table S1 at http://www.BiochemJ.org/bj/417/bj4170621add.htm. An animated version of this Figure can be seen at http://www.BiochemJ.org/bj/417/0621/bj4170621add.htm. (B) Universal phylogenomic tree of architectures reconstructed from a genomic census of protein domain structure and organization. A tree of architectures describing the evolution of domains and domain combinations at F level was reconstructed from a protein census in 266 genomes [200]. The census involved identifying domains using advanced HMMs of structural recognition and SCOP as reference. The three evolutionary epochs of the protein world are overlapped to the tree and are labelled with different shades (architectural diversification, light green; superkingdom specification, salmon; organismal diversification, yellow) and follow previous definitions [149]. Terminal leaves are not labelled since they would not be legible. Branches in red delimit the birth of architectures after the appearance of the first architecture unique to a superkingdom (broken line). The Venn diagrams show occurrence of architectures in the three superkingdoms of life. Pie charts show superkingdom distribution of architectures belonging to the four major categories of domain organization. The onset of the big bang of domain combinations is indicated in the tree. (C) Cumulative frequency distribution plots describing the appearance of all-α, all-β, α/β and α + β protein classes with only one domain as well as all the domain combinations with two domains or more than two domains along the branches of the tree described in (B). The cumulative number was given as a function of distance in nodes from the hypothetical ancestor (nd ). The inset shows details of the accumulation of ancient domains and domain combinations. Information on trees of proteomes and architectures, data matrices and tracing exercises can be found in the MANET (Molecular Ancestry Networks) database [193] (http://manet.uiuc.edu).

structures, α + β folds segregate these elements within the molecule, and all-α and all-β proteins are mostly composed of helical or strand elements respectively. These simple plots reveal that helical segments were generally longer in α/β folds than in their α + β counterparts and shorter in all-β proteins, with strand segments being shorter in all-α proteins, the implication of which will be discussed below. Unfortunately, global views place structures in a continuum space and obscure fundamental architectural differences and heterogeneities that discrete views can capture. Other strategies that lack these shortcomings are therefore useful, including the generation of fold family trees based on rules of structural transformation [122,123], taxonomies based on similarity of secondary structural arrangements [124]

and a PDUG (protein domain universe graph) representation of domains based on scores of structural similarity [125,126]. Some of these have captured salient natural features. For example, trees of secondary structures are in agreement with aspects of protein classification and suggest a simple mechanism of evolution that is in accord with a theory of folding based on the energetic of backbone hydrogen bonds [127]. The PDUG network representation of fold space is a graph that connects nodes (domains) with edges (structural similarities) in thresholddelimited clusters, and, similarly, captures the scale-free network topology that is typical of the protein world. However, problems associated with the systematic classification of structure at a topological level make it difficult, if not impossible, to find  c The Authors Journal compilation  c 2009 Biochemical Society

628

G. Caetano-Anoll´es and others

a general metric of pairwise comparison that could be used for global analysis and would portray all complexities of structural organization [128]. One solution to this drawback is a ‘periodic table’-like construct that merges the use of rules with the comparative framework [129]. In this approach, proteins are compared and assigned to idealized fold representations, which describe molecules as layered systems of helical and sheet structures (with curl and stagger). The approach shifts the problem to finding appropriate definitions for the idealized constructs and understanding their evolutionary meaning through models of structural transformation.

PHYLOGENOMICS AND THE WORLD OF PROTEOMES

One fundamental limitation of most global approaches that try to unify fold space is that they do not embrace the ‘shared and derived’ tenet of evolutionary analysis. They are not truly phylogenetic. At present, there are no reliable procedures that can generate phylogenetic relationships at higher hierarchical levels of protein classification directly from the structure of proteins. Methods cannot yet reconstruct history because knowledge of how the ‘origami’ of protein folding evolves is still lacking. One solution to the conundrum of structure is to shift the focus of study from molecules to proteomes, the repertoire of all proteins of an organism. After all, proteins are encoded in the genomes of the many organisms that populate Earth. The rationale is simple. Proteins with structures that are fit will thrive in evolution. They will propagate in lineages through vertical descent, recruitment and convergent evolutionary processes, and their architectural designs will be used repeatedly in different contexts. Their history should be left imprinted in the actual fold constitution of the proteome, and a simple structural census of this historical repository should unlock the ‘tempo and mode’ of structural discovery. Here, we summarize the exciting findings that this novel approach has revealed. Structures corresponding to validated crystallographic 3D models, catalogued, for example, in SCOP and CATH, have been assigned effectively to sequences present in proteomes using knowledge from domain classification and sequence and structure comparison tools such as profile-based sequence PSIBLAST alignments, linear HMMs of structural recognition, and threading techniques [79]. Fold architectures were initially surveyed in a number of genomes [130 –134] and this genomic demographic census was then indexed in several popular databases (e.g. PEDANT [135], SUPERFAMILY [136,137], and Gene3D [138,139]). The census is restricted to proteins for which a known structure can be inferred (currently, ∼ 60 % of the proteome), but it is powerful. It allows, for example, identification of SCOP FSF architectures corresponding to individual domains in enzymes of metabolism [140,141] and exploration of arrangements of domains in biological units [142,143]. Studies reveal patterns in both domain structure and domain organization and suggest, for example, pervasive recruitment of structures and functions in biological networks and an extended combinatorial interplay of domain modules in proteomic repertoires. The census also provides indications of how organisms in different superkingdoms make use of architectures, revealing that fold abundance and distribution of folds among genomes are unlinked [144]. Curiously, abundant protein domains occurred in proportion to proteome size in a survey of five eukaryotic genomes, suggesting functional constraints between interacting domains kept domains at specific ratios in evolution [145]. Since protein structure is highly conserved (Figure 1), every instance of discovery or adoption of an F or FSF architecture  c The Authors Journal compilation  c 2009 Biochemical Society

by a proteome represents a rare event in the history of the organismal lineage, and globally a rare event in the history of the protein world. These ‘molecular fossils’ are therefore excellent features (characters) for phylogenetic analysis. Gerstein [132] recognized this a decade ago and used fold occurrence in genomes and distance-based methods to build trees of proteomes (see Supplementary Figure S2 at http://www.BiochemJ.org/bj/417/ bj4170621add.htm). Since then, a number of trees of life of this kind have been reconstructed from the occurrence and abundance of domain structures in proteomes [107,132,134,146 –149] and from surveys of domain organization [150,151], matching patterns obtained from other sources of genomic information [152]. In all cases, the three superkingdoms appeared as distinct groups, confirming the tripartite nature of cellular life heralded by the school of Carl Woese [153]. Phylogenomic trees showed patterns that were in agreement with traditional classification, and also tested contentious hypotheses. For example, they backed the controversial grouping of chordates with arthropods (the Coelomata hypothesis), an observation supported by wholegenome trees (e.g. [154]) and recently confirmed by an analysis of the complete collection of phylogenies of gene sequences in the human genome (phylome) [155]. Moreover, some of these phylogenetic methods identified a root for the universal tree [107,149,150] (see Supplementary Figure S2) and suggested that diversified life originated in a proto-eukaryotic organism, a proposal for which there is now an emerging consensus [156] and which is also supported by phylogenetic analysis of the structure of rRNA [157]. It is noteworthy that parsimony considerations based simply on the survey of protein repertoires suggest the ancestor to the three superkingdoms was endowed with a virtual proteome akin to Eukarya [61]. A simple Venn diagram shows that two or three superkingdoms share the majority of F or FSF architectures and supports this view (see Supplementary Figure S3 at http://www.BiochemJ.org/bj/417/bj4170621add.htm). Most importantly, the fact that phylogenomic trees were able to reconstruct the evolution of life satisfactorily supported the existence of strong phylogenetic signal in the occurrence, abundance and organization of domains in proteomes. Trees of proteomes, however, could not reveal patterns of diversification of protein architecture directly, unless characters were traced along branches of the trees. For example, when domain sequences and architectures from 62 genomes were traced along a universal consensus phylogeny derived for whole-genome trees, convergent evolutionary processes that could not be explained by architectural loss were found to be rare events (∼ 2 %) [158]. A recent study of Pfam domains in 96 genomes confirmed this important observation, although the number of convergent events in protein structural evolution was found to be larger (∼ 12 %) [159]. This suggests that protein structures at high levels of organization diversify mostly by vertical descent, empowering the phylogenetic reconstruction exercise. Tracing domain occurrence patterns in trees of proteomes derived from fold occurrence and abundance [160] or universal trees reconstructed from the small subunit of rRNA [161] also allowed to estimate the relative age of individual folds and the antiquity of protein classes. The latter study assumes, however, that the history of a single (albeit central) RNA molecule and of proteomes is concordant, that there is a single origin of organismal superkingdoms, and that the bacterial outgroup chosen to root the universal tree is appropriate. As we will see below, all of these assumptions can be contentious. In search of a direct approach and using a strategy that polarizes characters and builds rooted phylogenetic trees [157], we introduced a new phylogenetic method that generates timelines of architectural discovery and a global phylogenetic view of the protein world [107]. Data matrices that were used to build

Evolution of the protein world

629

trees of proteomes were transposed, normalized and used to reconstruct trees of architectures that were intrinsically rooted [107,149,150,162,163]. Evolution’s arrow was established directly by the evolutionary model, the rationale and assumptions of which have been reviewed recently within a framework of evolution of repertoires of components in living systems [164] and will not be revisited here. Supplementary Figure S3 shows the first tree of F architectures that was reported and examples of trees of F and FSF architectures reconstructed more recently using updated releases of SCOP and information in numerous proteomes. The leaves of the trees (taxa) are, in this case, domains (see Supplementary Figure S3) or domains and domain combinations (Figure 4B) visualized at F or FSF levels of classification. The rooted trees establish by definition evolutionary timelines of architectural discovery, with time measured by a relative distance in nodes from a hypothetical ancestor at their bases (node distance, nd). A timeline showing the rise of protein classes in evolution is described in Figure 4(C). These timelines reveal remarkable historical patterns in the structure of proteins and proteomes, and, as we describe below, define an origin for the modern protein world and illustrate how biological functions were discovered in time. We caution, however, that statements relate only to modern biochemistry, as modern molecules were used to reconstruct the past. Any claims of origin and evolution relate necessarily to the design and complexities of extant molecules, and not to those of predecessors that were perhaps lost in the evolutionary process.

THE RISE OF DIVERSIFIED PROTEOMES, MODULARITY AND CELLULAR LIFE

The most notable feature of every tree of architectures that has been generated so far is that F or FSF domains widely distributed in Nature appear at their base and are consequently ancient. They are only found to be lacking in parasitic organisms with highly reduced genomes (e.g. Nanoarchaeum, Mycoplasma and Encephalitozoon), organisms known to have discarded enzymatic and cellular machinery in exchange for resources from their hosts [149]. The first nine F architectures to emerge in evolution are nevertheless common to every genome analysed and include architectures widespread in metabolism [165]; the evolution of the five most basal and their structures are illustrated in Figure 5. One likely interpretation of early evolution of ancient architectures using the neutral net paradigm described above (Figure 2C) is given in Figure 5(B). As protein sequences harbouring the primordial fold drift by mutation in sequence space, new neutral nets are discovered that fold sequences into new fold structures, while variants within ancient and new folds continue to be discovered in the original neutral nets in an ongoing exploration of more stable and fit variants. The comparison of trees of F and FSF architectures supports this view, revealing a collection of proteins undergoing divergent, but concomitant, evolutionary processes that translate into patterns of recent (close relationship) or ancient origin (distant relationship) [163]. This is a consequence of the hierarchical nature of protein structure and the limited exploration of sequence and structure space. We expect, as corollary, a correlation between abundance and age of individual architectures and time-lapsed discovery of fold variants. Indeed, the distribution of branch lengths (longer towards the base) and the unbalanced shape of phylogenomic trees (Figure 4B and see Supplementary Figure S3) suggests strongly that architectural discovery involved semipunctuated evolutionary processes similar to those recently suggested for substitutional change in nucleic acids [166].

Figure 5

Evolution of the five most ancient folds

(A) Phylogenetic relationships at the base of a phylogenomic tree of domain structures at the F level of structural organization [165] together with examples of the different domain architectures. All ancient architectures share a common design of α-helices and β-strands that form barrel or highly symmetrical structures. The structural models illustrate the 3D arrangement of helices (cyan) and strands (mauve) separated by turns and coils (brown). Structures included are: c.37, P-loop NTP hydrolase fold of adenylate kinase from Methanococcus thermolithotrophicus (PDB code 1KI9), depicting a putative enzymatic origin of metabolism; a.4, DNA/RNA-binding three-helical bundle from the Trp repressor mutant V58I protein (1JHG); c.1, TIM β/α barrel fold of inosine 5 -monophosphate dehydrogenase from Borrelia burgdorferi (PDB code 1EEP); NADP(P)-binding Rossmann fold of glyceraldehyde-3-phosphate dehydrogenase from Escherichia coli (PDB code 1GAD); d.58, ferredoxin-like fold of 7-Fe ferredoxin from Azotobacter vinelandi (PDB code 7FD1). (B) One of many possible interpretations of early evolution of the five most ancient architectures using the neutral net paradigm (Figure 2C). Circles of different colours illustrate neutral nets corresponding to each fold and embedding mutational walks in sequence space responsible for extant structural diversity at F level of hierarchical organization. F neutral nets should map FSF neutral nets and two nets corresponding to two lower levels of hierarchy of protein structure.

When the representation of architectures in organisms in Archaea, Bacteria and Eukarya was traced along the evolutionary timeline, patterns of origin and evolution of our contemporary tripartite world were clearly revealed ([149] and M. Wang, unpublished work). Ancient architectures were multifunctional and were shared by many organisms (free-living or parasitic) in the three superkingdoms [107,149,162,163]. These common  c The Authors Journal compilation  c 2009 Biochemical Society

630

Figure 6

G. Caetano-Anoll´es and others

The architectural and functional complement of the communal ancestor

The complement defines 78 FSFs that appeared before the first architecture that was completely lost in a superkingdom (Archaea) in the tree of FSF architectures (see Supplementary Figure S3B at http://www.BiochemJ.org/bj/417/bj4170621add.htm). FSFs were grouped according to coarse-grained functional SUPERFAMILY [201] categories and subcategories (peripheral pie) and according to major classes of globular proteins in SCOP (central pie).

architectures defined the so-called ‘architectural diversification’ epoch in protein evolution in which members of an ancestral community of organisms diversified their protein repertoires through differential loss (light-green-shaded area overlapping the tree of domain structure and organization described in Figure 4B). Remarkably, architectural loss occurred preferentially in organisms belonging to the lineages of Archaea, establishing the first organismal divide [149]. This reductive evolutionary strategy was protracted and perhaps induced by adaptation to the extreme physical conditions of early Earth. The early rise of Archaea matches recent evolutionary studies of the structure of tRNA [167] and universal trees of proteomes reconstructed from architectures identified to be ancient in the tree of architectures [149]. These trees of proteomes showed a rooting that was internal (paraphyletic) to the Archaea and was located between the Crenarchaeota and the Euryarchaeota, close to methanogenic archaeal species. This paraphyletic rooting is consistent with a mutational comparative analysis of tRNA paralogues that identified molecular species in the Archaea as slow-evolving and ancient [168] and the existence of ancestral genome characters such as split genes and operon organization [169]. It also has an impact on the interpretation of protein evolutionary tracing exercises that consider superkingdoms as evolutionarily unified groups, as these will identify not a single origin for proteins, but many [161]. A proposed multiple convergent (polyphyletic) origin of genes occurring after lineage diversification involving the modular reorganization of sequence [170] would, in fact, have the same effect. Nevertheless, these and many other lines of evidence suggest that Archaea is the most ancient lineage of the modern living world, an emerging view that is gaining consensus [156]. It is important to note that reductive tendencies in Archaea started at a time when superkingdom-specific architectures and present-day organismal lineages had not developed and life  c The Authors Journal compilation  c 2009 Biochemical Society

was probably communal [149]. In fact, a substantial portion of the protein world developed during this time and resulted in complex proteomes that were rich in functions and architectures (Figure 6). These results are, for example, in line with profiles from phylogenetic tracing of enzymes linked to bioenergetic processes [171], an architectural census [172] and recent ancestral state reconstruction of the gene content of the universal ancestor [173] that revealed a bioenergetically and functionally complex genome with a gene complement similar in number to that of extant free-living microbes (reviewed in [156]). Following the architectural diversification epoch, superkingdom-specific and lineage-specific architectures appeared in evolution as the world of organisms expanded [149]. In this new ‘superkingdom specification’ epoch, new reductive tendencies expressed in Bacteria and the superkingdoms were specified in what we believe was a protracted process (salmon-shaded areas in Figure 4B). Moreover, architectural representation decreased considerably with time until it approached zero, a point at which a large number of new architectures were clustered, each specific to a small number of organisms. Later on, an opposite trend took place, in which architectures that were more specialized and were specific to relatively small sets of organisms increased their representation in proteomes explosively. This architectural ‘big bang’ (paraphrasing that of the universe) involved the multiple combination and rearrangement of domains (Figures 4B and 4C) and the distribution of resulting multidomain proteins among emergent organismal lineages. We will not discuss the evolutionary patterns and processes that underlie these processes since they have been discussed recently [98]. They involve, however, preferential additions and deletions of terminal domains and fusion and fission processes that engage (with bias) different domain modules in a combinatorial interplay. The rise of modularity in the protein world defines the ongoing ‘organismal diversification’ epoch (light yellow areas). During

Evolution of the protein world

this last period, architectural novelties linked to multicellularity appeared massively and quite late both immediately after microbe diversification events (mostly folds common to organismal domains) and during eukaryotic diversification (mostly Eukaryaspecific) [149,162]. This included multidomain architectures known to be associated with programmed cell death, adhesion and recognition of cells [162]. Proteome distribution patterns along the timeline have had an impact on the constitution of present-day genomes, with the architectural repertoire being the largest and most diverse in Eukarya, and the smallest and most homogeneous in Archaea, with Bacteria taking an intermediate position (see the pie charts of Figure 4B). Remarkably, the diverse repertoire of the Bacteria superkingdom was by necessity compartmentalized into the small proteomes of individual organisms (L. S. Yafremava and G. Caetano-Anoll´es, unpublished work).

EVOLUTIONARY TIMELINES AND THE DISCOVERY OF ARCHITECTURES AND FUNCTIONS

There were many remarkable patterns linked to structure in the trees and resulting evolutionary timelines. The most ancestral folds harboured barrels [e.g. the TIM β/α-barrel fold (c.1)] or interleaved β-sheets and α-helical architectures that packed helices to one face [e.g. the ferredoxin-like fold (d.58)] or two faces [e.g. the P-loop-containing NTP hydrolases (c.37) and the NAD(P)-binding Rossmann fold (c.2)] of the central β-sheet arrangement [107,163]. These and other ancient architectures were multifunctional and interacted with organic cofactors [174], especially nucleotide-containing ligands such as ATP, ADP, GDP, NAD and FAD, all of which appear to have originated early in evolution according to a power-law distribution of ligand–protein mapping [175]. Architectures appearing later in the timelines were functionally more specialized and simple, with structures that were increasingly smaller and more compact (e.g. increases in the tilt of strands or the frequency of open barrel structures in the popular β-barrels; [107]). At the same time, structures became more refined, as illustrated with barrel structures harbouring increasingly more complex strand topologies. Many important structural designs were derived in the tree (including polyhedral folds in the all-α class and β-sandwiches, β-propellers and β-prisms in the all-β class) and protein transformation pathways describing likely scenarios of structural evolution [176,177] and other patterns could be traced in the trees [107]. Interestingly, all classes of globular protein architecture appeared very early in evolution and in defined order, the α/β class being the first, followed by the α + β, the all-α and the all-β classes, and by small and multidomain proteins [107,163]. Patterns of origin and accumulation of protein classes were consistently revealed in all trees analysed, including those derived from a tree of domains and domain combinations (Figure 4). A similar conclusion was reached when tracing fold occurrence along branches of proteome [160] and rRNA trees [161], and when studying the evolution of aminoacyl-tRNA synthetases [178]. We proposed architectural designs with interspersed α-helical and β-sheet elements were segregated in the course of evolution, first within their structure (α + β class) and then confined to separate molecules (all-α and all-β classes) [107]. This is in accord with the random origin hypothesis of proteins [179]. Several interesting features distinguish the ancient α/β protein class from the rest. For example, topological accessibility measurements describing how easy it is to fold a structure from any point in the polypeptide chain showed a marked asymmetry toward the N-terminus in α/β folds, a property that was mostly confined to selected protein families

631

[180]. Measurements of closeness to the molecular centroid and residue contact distribution also revealed the bias, which was more notable in ancient than in more recent α/β folds [181]. These observations were interpreted as evidence of ancient α/β folds predating chaperone-assisted folding and preserving the bias as a relic [180] or of unmasked co-translational folding in extant proteins [181]. Co-translation folding is the ability of proteins to fold as they exit the ribosome, but the process remains contentious. Interestingly, our evolutionary timelines revealed that FSF architectures linked to chaperone and proteostasis systems in the cell appeared early with the ATPase domain of the Hsp90 chaperone (d.122.1) and throughout the timeline (ndFSF = 0.06– 0.86). However, the dominant families that contributed to the Nterminal bias the most appeared earlier (e.g. c.37 and c.2), supporting the idea that the origin of this asymmetry lies in the past. We found α-helical segments were generally longer in α/β folds (Figure 4A), a trend that was especially notable with the early folds (see the animated version of Figure 4A). This could indicate these helical segments were overrepresented in the ancient interspersed α/β structures. It is well known that single extended β-sheets are quite effective at burying non-polar surfaces when compared with α-helices [182]. Moreover, a surprising in vitro model study of co-translational protein folding suggests an initial tendency to form misfolded sheets in an all-α protein (apomyoglobin), a tendency that decreases with protein length and underscores the importance of co-translationally active chaperones [183]. Perhaps the length-dependent misfolding tendencies in non-native proteins left behind relics in the ancient α/β proteins that had to fold unassisted, which tried to increase the length of helices to balance the secondary structure repertoire. Interestingly, a survey of hundreds of genomes reveals domains are longer in very ancient proteins (M. Wang, unpublished work) and not shorter as claimed in a recent phylogenetic tracing study [161]. We therefore propose that longer helical segments provided an advantage in early protein evolution and were then slowly replaced by strands and a reduction in protein length once chaperone systems were in place. This scenario would explain α-to-β tendencies uncovered in the tree of architectures [107]. Tracing biological function along the timeline revealed patterns of origin of fundamental cellular processes (Figure 7), confirming the very early and explosive onset of metabolism [149] and smallmolecule-binding chemistries [175]. It appears that proteins were first associated with organic cofactors, but later involved transition metals as ligands, perhaps mediated by the increasing energy demands of the ancient world. Timelines revealed a relatively early rise of metallomes (with the zinc-metallome appearing first) (C. L. Dupont, G. Caetano-Anoll´es and P. E. Bourne, unpublished work), and the late appearance of oxygenic photosynthesis, which was preceded and followed by the discovery of functions typical of Eukarya (cell adhesion, receptors and chromatin structure, and functions linked to multicellularity). Some of these results are consistent with a proteomic analysis that suggest that shifts in trace metal geochemistry related to the redox state of ancient oceans are imprinted in protein architecture and suggests that prokaryotes evolved in anoxic marine environments, whereas eukaryotes did so in oxic counterparts [184]. The late evolutionary appearance of oxygenic photosynthesis confirms results from a phylogenomic analysis of metabolic networks [149] and is consistent with molecular and geological records that suggest that oxygen entered our atmosphere after major microbial divergences in the tree of life [185]. All functional categories and most subcategories appeared for the first time during the architectural diversification epoch, lending credence to the complex nature of the ‘communal ancestor’ to diversified life [149]. In fact, the functional and structural  c The Authors Journal compilation  c 2009 Biochemical Society

632

G. Caetano-Anoll´es and others

diversity of its architectural complement (Figure 6) suggests that biological functions were geared fundamentally to metabolic activities, proteostasis and protein degradation, and, as expected, were embodied mostly in α/β protein architectures. Major subcategories pooled transferases, nucleotide metabolism and smallmolecule binding enzymes, matching recent metabolic network investigations [165]. Coenzyme, carbohydrate and energy metabolisms also featured prominently. These cells also had architectures involved in an incipient translation apparatus. Nucleic acid processing (DNA replication/repair) was embodied in Nudix (d.113.1) and DNA breaking–rejoining enzyme (d.163.1) FSFs linked to pyrophosphorylase/pyrophosphatase and RNAdecapping activities and integrases and topoisomerases respectively. Only five functional subcategories originated later on in the organismal specification epoch and were clearly related to the cellular make-up of organisms; they involved lipid/membrane binding and structural proteins, proteins associated with cell envelope biogenesis and the outer membrane, viral proteins and proteins related to oxygenic photosynthesis. Only one subcategory had its origin in the organismal diversification epoch (blood clotting). These proteins are therefore important markers in the architectural chronology. Similarly, α-solenoids, β-propellers, coiled coils and other architectures linked to the nuclear pore complex [186], a marker for the nuclear envelope in Eukarya and some bacterial lineages (e.g. Planctomycete and Verrucomicrobia [187]), appeared (together with karyopherins that interact transiently with the complex) very late in evolution (ndFSF = 0.82–1.00). Nuclear pores therefore represent very modern protein complexes that were horizontally transferred or evolved convergently in Eukarya and Bacteria. Of all the main categories, extracellular processes appeared the latest, close to the boundary of the superkingdom specification epoch. These categories involve immune responses, toxin and defence enzymes, and cell adhesion, functions related to definition of self and intercellular interactions (competition and multicellularity). It is logical that these functions would appear at the end of a communal world of organisms. The appearance of information-related processes and cellular motility has important consequences for origins of modern biochemistry and diversified life. Translation originated quite early and preceded the DNA repair/replication, transcription, RNA processing and chromatin structure subcategories, which developed in the timelines in that order (Figure 7). The early origin of translation was confirmed by tracing architectures of aminoacyl-tRNA synthetases, elongation factors and ribosomal proteins derived from crystallographic models and HMM searches in the trees (D. Caetano-Anoll´es, unpublished work). Models of amino acid evolution also supported the antiquity of aminoacyltRNA synthetases [178]. The observation that the origin of modern protein synthesis developed only after metabolic proteinaceous enzymes were in place suggests strongly that the translation apparatus suffered a fundamental revision during evolution of modern proteins. The inception of cell motility also has important consequences. The microbes of the communal world were probably auxotrophic or heterotrophic organisms seeking to improve their survival in the changing environments of early Earth. Cellular motility allowed better tools to seek and ingest food and in some lineages to prey on other members of the community. The development of phagotrophy (a hallmark of Eukarya) and mechanisms of cell motility could have ignited the rise of the tripartite world [188]. Indeed, fundamental FSF architectures associated with a number of important molecules linked to cell movement (e.g. tubulin, moesin, profilin and actin) originated at the end of the architectural diversification epoch (e.g. tubulin nucleotide-binding domain c The Authors Journal compilation  c 2009 Biochemical Society

like and C-terminal domain like, ndFSF = 0.31) and continued to accumulate (Figure 7), but together with toxins and defence architectures, which could have brought other means of warfare (Figure 7). Whereas many important proteins related to motility developed later in the timeline [e.g. phase 1 flagellin domain (ndFSF = 0.58), profilin (ndFSF = 0.73), moesin tail domain (ndFSF = 0.76), actin-cross-linking and depolymerizing domains (ndFSF = 0.85)], others that were multifunctional and ancient were probably recruited for the task (e.g. actin-like ATPase domain architecture, ndFSF = 0.04). It is therefore quite likely that the world of organisms underwent a transition from communal to competitive during superkingdom specification and that this triggered diversification of life.

THE ORIGIN OF THE PROTEIN WORLD AND THE RISE OF MODERN METABOLISM

It is generally assumed that life originated as an emergent dissipative system with a series of autocatalytic processes that produced primordial metabolites [189–192]. Among these chemicals are the nucleotides and amino acids that are prerequisite for an ancient RNA world and an emergent protein world. As the latter developed, the first reactions available for RNA and protein molecules must have been metabolic reactions. Timelines already suggest that modern metabolism appeared very early on in evolution (Figure 7). However, a detailed phylogenomic tracing analysis of protein architecture in metabolic networks [193] revealed that the nine most ancient architectures were responsible for the explosive appearance of most modern enzymatic functions [165]. In fact, a careful dissection of recruitment patterns indicated that modern metabolism originated in enzymes of nucleotide metabolism harbouring the P-loop-containing NTP hydrolase fold, probably in pathways linked to the purine metabolic subnetwork. This study was complemented recently with a battery of other evolutionary bioinformatic approaches, which revealed a succession of recruitment gateways, each mediated by the discovery of a new primordial fold [194]. These gateways produced a layered system reminiscent of Morowitz’s prebiotic shells [195] describing early evolutionary progressions and take-overs of ancient prebiotic chemistries. The first gateway originated in nucleotide metabolism, involved mostly transferases and was then extended to metabolism of cofactors. It was immediately followed by an ‘energy amphiphile’ lipid–carbohydrate core that provided enzymes for energy and hydrocarbon precursors established primordially in the selfreplicating prebiotic entity. The TIM β/α barrel-mediated gateway later introduced amination reactions that converted keto acids into amino acids, mediating the incorporation of nitrogen into a multitude of metabolic processes. These opened new recruitment possibilities and generated explosively the chemical diversity we currently encounter in modern metabolism. Phylogenomics therefore provides for the first time a link between the prebiotic and modern worlds, showing metabolism as a palimpsest that recapitulates prebiotic and perhaps ribozymic chemistries. We note that many of the very ancient architectures were involved in functions associated with ancient genes that were recently identified by physical clustering in bacterial genomes [196]. In this study, the evolutionarily conserved gene core divided into three layers, the first highly connected centred around informational processes (fundamentally the ribosome and translation), a second featuring tRNA synthetases and other processes (e.g. proteolysis), and an outer loosely connected layer (assumed to be more ancestral) linked to metabolism and highlighting metabolism of nucleotides, coenzymes and fatty acid molecules. The

Evolution of the protein world

Figure 7

633

Evolution of biological function in the protein world

The evolutionary timeline shows the discovery of protein FSF architectures associated with different coarse-grained functional SUPERFAMILY categories in each superkingdom, with time measured by a relative distance in nodes from a hypothetical ancestral architecture at the base of the tree of architectures (see Supplementary Figure S3B at http://www.BiochemJ.org/bj/417/bj4170621add.htm). Pie charts below bins describe the distribution of architectures that are unique or shared between superkingdoms, and their areas are proportional to the total number of architectures in that bin. Arrowheads indicate the first appearance of architectures associated with functional subcategories that are listed. Details of their individual accumulation can be found in Supplementary Figure S4 at http://www.BiochemJ.org/bj/417/bj4170621add.htm. The three evolutionary epochs and corresponding phases of the protein world are labelled with different shades and follow previous definitions [149].

 c The Authors Journal compilation  c 2009 Biochemical Society

634

G. Caetano-Anoll´es and others

overall picture of these studies points clearly to an origin of modern proteins in the synthesis of nucleotides for a world in which RNA was the only encoded catalyst, but also to the coexistence of RNA, proteins and prebiotic chemistries, a concept that is in line with recent prebiotic experiments [192]. The centrality of RNA in the primordial make-up of the early proteinencoding organisms is revealed. PROSPECTS

Ever since the first crystallographic structure was reported 50 years ago for sperm whale myoglobin (PDB code 1MBN) [197], advances in comparative and structural genomics continue to provide an increasing number of sequences and crystal structures that are available for the study of the modern protein world. Recent advances in our understanding of protein structure and folding and the construction of powerful classification schemes provide a more thorough description of the hierarchical structure of this world. The linking of molecular evolution and structural biology now provides evolutionary views that are unprecedented. They prompt us to answer important questions. How discrete or continuous is protein space? What are the fundamental processes that drive the evolution of protein structure? What is the tempo and mode of architectural discovery? At what structural resolution do proteomes differ and how does it affect our definition of species? What are the principles that drive the evolutionary mechanics of domain combination in Nature? When and how did individual biological functions originated and evolved? We have reviewed the remarkable patterns related to the origin, evolution and structure of the protein world and the diversification of life inferred from comparative and phylogenomic analysis of protein structure. History reconstruction exercises unfold timelines of the discovery of architectures and functions and an emergent picture of primordial biochemistries. They uncover episodes of specialization, exemplified by the explosive rise of functionally specialized multidomain proteins. They also reveal patterns of simplification, such as reductive tendencies of protein repertoires in the proteomes of microbial organisms. More importantly, results test long-standing and controversial hypotheses of how life originated and evolved. The gates to the mysteries of how the living world emerged have been opened, and we are expecting a flood of new exciting discoveries. ACKNOWLEDGMENTS We thank Professor Steven Huber and Professor Alex Toker for the invitation to write this review, and members of the GCA research group for constructive discussions.

FUNDING Supported by the National Science Foundation [grant numbers MCB-0343126 and MCB0749836], the C-FAR Sentinel Program, the United States Department of Agriculture and the Critical Research Initiative of the University of Illinois.

REFERENCES 1 Pauling, L. and Corey, R. B. (1951) The polypeptide-chain configuration in hemoglobin and other globular proteins. Proc. Natl. Acad. Sci. U.S.A. 37, 282–285 2 Linderstrøm-Lang, K. and Schellman, J. A. (1959) Protein structure and enzymatic activity. In The Enzymes, 2nd edn (Lardy, H. and Myrback, K, eds.), pp. 443–510, Academic Press, New York 3 S¨oding, J. and Lupas, A. N. (2003) More than the sum of their parts: on the evolution of proteins from peptides. BioEssays 25, 837–846  c The Authors Journal compilation  c 2009 Biochemical Society

4 Vogel, C., Bashton, M., Kerrison, N. D., Chothia, C. and Teichmann, S. A. (2004) Structure, function and evolution of multidomain proteins. Curr. Opin. Struct. Biol. 14, 208–216 5 Pereira-Leal, J. B., Levy, E. D., Kamp, C. and Teichmann, S. A. (2007) Evolution of protein complexes by duplication of homomeric interactions. Genome Biol. 8, R51 6 Schellman, J. A. and Schellman, C. G. (1997) Kaj Ulrik Linderstrøm-Lang (1896–1959). Protein Sci. 6, 1092–1100 7 Epstein, C. J., Goldberger, R. F. and Anfinsen, C. B. (1963) The genetic control of tertiary protein structure: model systems. Cold Spring Harbor Symp. Quant. Biol. 28, 439–449 8 Anfinsen, C. B. (1973) Principles that govern the folding of protein chains. Science 181, 223–230 9 Onuchic, J. N. and Wolynes, P. G. (2004) Theory of protein folding. Curr. Opin. Struct. Biol. 14, 70–75 10 Dill, K. A., Ozkan, S. B., Shell, M. S. and Weiki, T. R. (2008) The protein folding problem. Annu. Rev. Biophys. 37, 289–316 11 Englander, S. W., Mayne, L. and Krishna, M. M. G. (2007) Protein folding and misfolding: mechanism and principles. Q. Rev. Biophys. 40, 287–326 12 Ozkan, S. B., Wu, G. H. A., Chodera, J. D. and Dill, K. A. (2007) Protein folding by zipping and assembly. Proc. Natl. Acad. Sci. U.S.A. 104, 11987–11992 13 Duan, Y. and Kollman, P. A. (1998) Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. Science 282, 740–744 14 Zagrovic, B., Snow, C. D., Shirsts, M. R. and Pande, V. S. (2002) Simulation of folding of a small α-helical protein in atomistic detail using world-wide distributed computing. J. Mol. Biol. 323, 927–937 15 Felts, A. K., Harano, Y., Gallicchio, E. and Levy, R. M. (2004) Free-energy surfaces of β-hairpin and α-helical peptides generated by replica exchange molecular dynamics with the AGBNP implicit solvent models. Proteins 56, 310–321 16 Ołdiej, S., Czaplewski, C., Liwo, A., Chinchio, M., Nanias, M., Vila, J. A., Khalili, M., Arnautova, Y. A., Jagielska, A., Makowski, M. et al. (2005) Physics-based protein-structure prediction using a hierarchical protocol based on the UNRES force field: assessment in two blind tests. Proc. Natl. Acad. Sci. U.S.A. 102, 7547–7552 17 Lei, H. and Duan, Y. (2007) Ab initio folding of albumin binding domain from all-atom molecular dynamics simulation. J. Phys. Chem. B 111, 5458–5463 18 Major, U., Guydosh, N. R., Johnson, C. M., Grossmann, J. G., Sato, S., Jas, G. S., Freund, S. M. V., Alonso, D. O. V., Daggett, V. and Fersht, A. R. (2003) The complete folding pathway of a protein from nanoseconds to microseconds. Nature 421, 863–867 19 Religa, T. L., Markson, J. S., Major, U., Freund, S. M. and Fersht, A. R. (2005) Solution structure of a protein denatured state and folding intermediate. Nature 437, 1053–1056 20 Hoelzer, G. A., Smith, E. and Pepper, J. W. (2006) On the logical relationship between natural selection and self-organization. J. Evol. Biol. 19, 1785–1794 21 Schuster, P., Fontana, W., Stadler, P. and Hofacker, I. (1994) From sequences to shapes and back: a case study in RNA secondary structures. Proc. R. Soc. London Ser. B 255, 279–284 22 Babajide, A., Hofacker, I. L., Sippl, M. J. and Stadler, P. F. (1997) Neutral networks in protein space: a computational study based on knowledge-based potential of mean force. Folding Des. 2, 261–269 23 Fontana, W. (2002) Modelling ‘evo-devo’ with RNA. BioEssays 24, 1164–1177 24 Schuster, P. and Stadler, P. F. (2003) Networks in molecular evolution. Complexity 8, 34–42 25 Maynard Smith, J. (1970) Natural selection and the concept of protein space. Nature 225, 563–564 26 Salisbury, F. B. (1969) Natural selection and the complexity of the gene. Nature 224, 342–343 27 Schultes, E. A. and Bartel, D. P. (2000) One sequence, two ribozymes: implications for the emergence of new ribozyme folds. Science 289, 448–452 28 Babajide, A., Farber, R., Hofacker, I. L., Inman, J., Lapedes, A. S. and Stadler, P. F. (2001) Exploring protein sequence space using knowledge based potentials. J. Theor. Biol. 212, 35–46 29 Bornberg-Bauer, E. (1997) How are model protein structures distributed in sequence space? Biophys. J. 73, 2393–2403 30 Bastolla, U., Roman, H. E. and Vendruscolo, M. (1999) Neutral evolution of model proteins: diffusion in sequence space and overdispersion. J. Theor. Biol. 200, 49–64 31 Govindarajan, S. and Goldstein, R. A. (1997) The foldability landscape of model proteins. Biopolymers 42, 427–438 32 Orengo, C. A., Jones, D. T. and Thornton, J. M. (1994) Protein superfamilies and domain superfolds. Nature 372, 631–634 33 Bershtein, S. and Tawfik, D. S. (2008) Advances in laboratory evolution of proteins. Curr. Opin. Chem. Biol. 12, 151–158 34 Martinez, M. A., Pezo, V., Marl´ere, P. and Wain-Hobson, S. (1997) Exploring the functional robustness of an enzyme by in vitro evolution. EMBO J. 15, 1203–1210 35 Keefe, A. D. and Szostak, J. W. (2001) Functional proteins from a random-sequence library. Nature 410, 715–718

Evolution of the protein world 36 Seelig, B. and Szostak, J. W. (2007) Selection and evolution of enzymes from a partially randomized non-catalytic scaffold. Nature 448, 828–831 37 Bornberg-Bauer, E. and Chan, H. S. (1999) Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. Proc. Natl. Acad. Sci. U.S.A. 96, 10689–10694 38 Taverna, D. M. and Goldstein, R. A. (2002) Why are proteins so robust to site mutations? J. Mol. Biol. 315, 479–484 39 Wroe, R., Bornberg-Bauer, E. and Chan, H. S. (2005) Comparing folding codes in simple heteropolymer models of protein evolutionary landscapes: robustness of the superfunnel paradigm. Biophys. J. 88, 118–131 40 Huynen, M. A., Stadler, P. F. and Fontana, W. (1996) Smoothness within ruggedness: the role of neutrality in adaptation. Proc. Natl. Acad. Sci. U.S.A. 93, 397–401 41 van Nimwegen, E., Crutchfield, J. and Huynen, M. (1999) Neutral evolution of mutational robustness. Proc. Natl. Acad. Sci. U.S.A. 96, 9716–9720 42 Cordes, M. H. J., Burton, R. E., Walsh, N. P., McKnight, C. J. and Sauer, R. T. (2000) An evolutionary bridge to a new protein fold: interconversion of two native structures in a single mutant protein. Nat. Struct. Biol. 7, 1129–1132 43 Bloom, J. D., Silberg, J. J., Wilke, C. O., Drummond, D. A., Adami, C. and Arnold, F. H. (2005) Thermodynamic prediction of protein neutrality. Proc. Natl. Acad. Sci. U.S.A. 102, 606–611 44 Wroe, R., Chan, H. S. and Bornberg-Bauer, E. (2007) A structural model of latent evolutionary potentials underlying neutral networks in proteins. HFSP J. 1, 79–87 45 James, L. C. and Tawfik, D. S. (2003) Conformational diversity and protein evolution: a 60-year old hypothesis revisited. Trends Biochem. Sci. 28, 361–368 46 Aharoni, A., Gaidukov, L., Khersonsky, O., Gould, S. M., Roodvelt, C. and Tawfik, D. S. (2005) The ‘evolvability’ of promiscuous protein functions. Nat. Genet. 37, 73–76 47 Amitai, G., Devi-Gupta, R. and Tawfik, D. S. (2007) Latent evolutionary potentials under the neutral mutational drift of an enzyme. HFSP J. 1, 67–68 48 Bershtein, S., Segal, M., Bekerman, R., Tokuriri, N. and Tawfik, D. S. (2006) Robustness–epistasis link shapes the fitness landscape of a randomly drifting protein. Nature 444, 929–932 49 Trent, J. D., Gabrielsen, M., Jensen, B., Neuhard, J. and Olsen, J. (1994) Acquired thermotolerance and heat shock proteins in thermophiles from the three phylogenetic domains. J. Bacteriol. 176, 6148–6152 50 Laksanalamai, P., Whitehead, T. A. and Robb, F. T. (2004) Minimal protein-folding systems in hyperthermophilic Archaea. Nat. Rev. Microbiol. 2, 315–324 51 Saibil, H. R. (2008) Chaperone machines in action. Curr. Opin. Struct. Biol. 18, 35–42 52 Vainberg, I. E., Ampe, C., Cowan, N. J., Kleine, H. L., Lewis, H., Rommelaere, J. and Vandekerckhove, J. (1998) Prefoldin, a chaperone that delivers unfolded proteins to cytosolic chaperonin. Cell 93, 863–873 53 Ellis, R. J. and Minton, A. P. (2006) Protein aggregation in crowded environments. Biol. Chem. 387, 485–497 54 Glickman, M. H. and Ciechanover, A. (2002) The ubiquitin–proteasome proteolytic pathway: destruction for the sake of construction. Physiol. Rev. 82, 373–428 55 Balch, W. E., Morimoto, R. I., Dillin, A. and Kelly, J. W. (2008) Adapting proteostasis for disease intervention. Science 319, 916–919 56 Ron, D. and Walter, P. (2007) Signal integration in the endoplasmic reticulum unfolded protein response. Nat. Rev. Mol. Cell Biol. 8, 519–529 57 Wiseman, R. L., Powers, E. T., Buxbaum, J. N., Kelly, J. W. and Balch, W. E. (2007) An adaptable standard for protein export from the endoplasmic reticulum. Cell 131, 809–821 58 Bull, A. T., Goodfellow, M. and Slater, J. H. (1992) Biodiversity as a source of innovation in biotechnology. Annu. Rev. Microbiol. 46, 219–252 59 Whitman, W. B., Coleman, D. C. and Wiebe, W. J. (1998) Prokaryotes: the unseen majority. Proc. Natl. Acad. Sci. U.S.A. 95, 6578–6583 60 Brocchieri, L. and Karlin, S. (2005) Protein length in eukaryotic and prokaryotic proteomes. Nucleic Acids Res. 33, 3390–3400 61 Kurland, C. G., Canb¨ack, B. and Berg, O. G. (2007) The origins of modern proteomes. Biochimie 89, 1454–1563 62 Drake, J. W., Charlesworth, B., Charlesworth, D. and Crow, J. F. (1998) Rates of spontaneous mutation. Genetics 148, 1667–1686 63 Bajaj, M. and Blundell, T. (1984) Evolution and the tertiary structure of proteins. Annu. Rev. Biophys. Bioeng. 13, 453–492 64 Vukmirovic, O. G. and Tilghman, S. M. (2000) Exploring genome space. Nature 405, 820–822 65 Sober, E. and Steel, M. (2002) Testing the hypothesis of common ancestry. J. Theor. Biol. 218, 395–408 66 Penny, D., Hendy, M. D. and Poole, A. M. (2003) Testing fundamental evolutionary hypotheses. J. Theor. Biol. 223, 377–385 67 Mossell, E. (2003) On the impossibility of reconstructing ancestral data and phylogenies. J. Comp. Biol. 10, 669–678

635

68 Pal, C., Papp, B. and Hurst, L. D. (2001) Highly expressed genes in yeast evolve slowly. Genetics 158, 927–931 69 Wall, D. P., Hirsch, A. E., Fraser, H. B., Kum, J., Giaever, G., Eisen, M. B. and Feldman, M. W. (2005) Functional genomic analysis of the rates of protein evolution. Proc. Natl. Acad. Sci. U.S.A. 102, 5483–5488 70 Drummond, D. A., Raval, A. and Wilke, C. O. (2006) A single determinant dominates the rate of yeast protein evolution. Mol. Biol. Evol. 23, 327–337 71 Kim, P. M., Lu, L. J., Xia, Y. and Gerstein, M. B. (2006) Relating three-dimensional structures to protein networks provides evolutionary insights. Science 314, 1938–1941 72 Kim, P. M., Korbel, J. A. and Gerstein, M. B. (2007) Positive selection at the protein network periphery: evaluation in terms of structural constraints and cellular context. Proc. Natl. Acad. Sci. U.S.A. 104, 20274–20279 73 Zhou, T., Drummond, D. A. and Wilke, C. O. (2008) Contact density affects protein evolutionary rate from bacteria to animals. J. Mol. Evol. 66, 395–404 74 Simon, A. L., Stone, E. A. and Sidow, A. (2002) Inference of functional regions in proteins by quantification of evolutionary constraints. Proc. Natl. Acad. Sci. U.S.A. 99, 2912–2917 75 Cooper, G. M. and Brown, C. D. (2008) Qualifying the relationship between sequence conservation and molecular function. Genome Res. 18, 201–205 76 Grant, A., Lee, D. and Orengo, C. (2004) Progress towards mapping the universe of protein folds. Genome Biol. 5, 107 77 Kunin, V., Cases, I., Enright, A. J., de Lorenzo, V. and Ouzounis, C. A. (2003) Myriads of protein families, and still counting. Genome Biol. 4, 401 78 Zhang, Y., Hubner, I. A., Arakaki, A. K., Shakhnovich, E. and Skolnick, J. (2006) On the origin and highly likely completeness of single-domain protein structures. Proc. Natl. Acad. Sci. U.S.A. 103, 2605–2610 79 Marsden, R. L. and Orengo, C. A. (2008) The classification of protein domains. In Bioinformatics, Volume II: Structure, Function and Applications, vol. 453 (Keith, J. M., ed.), pp. 123–146, Humana Press, Totowa 80 Richardson, J. S. (1981) The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34, 167–339 81 Finn, R. D., Mistry, J., Schuster-B¨ockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R. et al. (2006) Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–D251 82 Murzin, A., Brenner, S. E., Hubbard, T. and Chothia, C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 83 Andreeva, A., Howorth, D., Chandonia, J. M, Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36, D414–D425 84 Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. and Thornton, J. M. (1997) CATH: a hierarchic classification of protein structure. Structure 5, 1093–1098 85 Greene, L. H., Lewis, T. E., Addou, S., Cuff, A., Dallman, T., Dibley, M., Redfern, O., Pearl, F., Nambudiry, R., Reid, A. et al. (2007) The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 35, D291–D297 86 Holm, L. and Sander, C. (1998) Dictionary of recurrent domains in protein structures. Proteins 33, 88–89 87 Hardley, C. and Jones, D. T. (1999) A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure 7, 1099–1112 88 Mulder, N. J., Apweiler, R., Altwood, T. K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R. et al. (2007) New developments in the InterPro database. Nucleic Acids Res. 35, D224–D228 89 Redfern, O. C., Desailly, B. and Orengo, C. A. (2008) Exploring the structure and function paradigm. Curr. Opin. Struct. Biol. 18, 394–402 90 Bashton, M. and Chothia, C. (2007) The generation of new functions by the combination of domains. Structure 15, 85–99 91 Reeves, G. A., Dallman, T. J., Redfern, O. C., Akpor, A. and Orengo, C. A. (2006) Structural diversity of domain superfamilies in the CATH database. J. Mol. Biol. 360, 725–741 92 Pan, J. L. and Bardwell, J. C. A. (2006) The origami of thioredoxin-like folds. Protein Sci. 15, 2217–2227 93 Shindyalov, I. N. and Bourne, P. E. (2000) An alternative view of protein fold space. Proteins 38, 247–260 94 Harrison, A., Pearl, F., Mott, R., Thornton, J. and Orengo, C. (2002) Quantifying the similarities within fold space. J. Mol. Biol. 323, 909–926 95 Kolodny, R., Petrey, D. and Honig, B. (2006) Protein structure comparison: implications for the nature of ‘fold space’, and structure and function prediction. Curr. Opin. Struct. Biol. 16, 393–398 96 Xie, L. and Bourne, P. E. (2008) Detecting evolutionary relationships across existing fold space, using sequence order-independent profile-profile alignments. Proc. Natl. Acad. Sci. U.S.A. 105, 5441–5446  c The Authors Journal compilation  c 2009 Biochemical Society

636

G. Caetano-Anoll´es and others

97 Andreeva, A. and Murzin, A. G. (2006) Evolution of protein fold in the presence of functional constraints. Curr. Opin. Struct. Biol. 16, 399–408 98 Moore, A. D., Bj¨orklund, A˚. K., Ekman, D., Bornberg-Bauer, E. and Elofsson, A. (2008) Arrangements in the modular evolution of proteins. Trends Biochem. Sci. 33, 444–451 99 Koonin, E. V., Wolf, Y. I. and Karev, G. P. (2002) The structure of the protein universe and genome evolution. Nature 420, 218–223 100 Huynen, M. A. and van Nimwegen, E. (1998) The frequency distribution of family sizes in complete genomes. Mol. Biol. Evol. 15, 583–589 101 Rzhetsky, A. and Gomez, S. M. (2001) Birth of scale-free molecular networks and the number of distinct DNA and protein domains per genome. Bioinformatics 17, 988–996 102 Quian, J., Luscombe, N. M. and Gerstein, M. (2001) Protein family and fold occurrence in genomes: power-law behavior and evolutionary model. J. Mol. Biol. 313, 673–681 103 Coulson, A. F. and Moult, J. A, (2002) A unifold, mesofold and superfold model of protein fold use. Proteins 46, 61–71 104 Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S. and Koonin, E. V. (2002) Birth and death of protein domains: a simple model of evolution explains power law behavior. BMC Evol. Biol. 2, 18 105 Karev, G. P., Wolf, Y. I. and Koonin, E. V. (2003) Simple stochastic birth and death models of genome evolution: was there enough time for us to evolve? Bioinformatics 19, 1889–1900 106 Karev, G. P., Wolf, Y. I., Berezovskaya, F. S. and Koonin, E. V. (2004) Gene family evolution: an in-depth theoretical and simulation analysis of non-linear birth–death–innovation models. BMC Evol. Biol. 4, 32 107 Caetano-Anoll´es, G. and Caetano-Anoll´es, D. (2003) An evolutionarily structured universe of protein architecture. Genome Res. 13, 1563–1571 108 Goldstein, R. A. (2008) The structure of protein evolution and the evolution of protein structure. Curr. Opin. Struct. Biol. 18, 170–177 109 Zeldovich, K. B., Chen, P., Shakhnovich, B. E. and Shakhnovich, E. I. (2007) A first-principles model of early evolution: emergence of gene families, species and preferred protein folds. PLoS Comput. Biol. 3, 1224–1238 110 Darwin, C. R. (1859) On the Origin of Species by Means of Natural Selection, Murray, London 111 Woese, C. R. (2004) A new biology for a new century. Microbiol. Mol. Biol. Rev. 68, 173–186 112 Eventhoff, W. and Rossmann, M. G. (1975) The evolution of dehydrogenases and kinases. CRC Crit. Rev. Biochem. 3, 111–140 113 Johnson, K. S., Sutcliff, M. J. and Blundell, T. L. (1990) Molecular anatomy: phyletic relationships derived from three-dimensional structures of proteins. J. Mol. Evol. 30, 43–59 114 Bujnicki, J. M. (2000) Phylogeny of restriction endonuclease-like superfamily inferred from comparison of protein sequences. J. Mol. Evol. 50, 39–44 115 Breitling, R., Laubner, D. and Adamski, J. (2001) Structure-based phylogenetic analysis of short-chain alcohol dehydrogenases and reclassification of the 17β-hydroxysteroid dehydrogenase family. Mol. Biol. Evol. 18, 2154–2161 116 O’Donoghue, P. and Luthey-Schulten, Z. (2003) On the evolution of structure in aminoacyl-tRNA synthetases. Microbiol. Mol. Biol. Rev. 67, 550–573 117 Scheef, E. D. and Bourne, P. E. (2005) Structural evolution of the protein kinase-like superfamily. PLoS Comp. Biol. 1, e49 118 Holm, L. and Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 223, 123–138 119 Røgen, P. and Fain, B. (2003) Automatic classification of protein structure by using Gauss integrals. Proc. Natl. Acad. Sci. U.S.A. 100, 119–124 120 Hou, J., Sims, G. E., Zhang, C. and Kim, S.-H. (2003) A global representation of the protein fold space. Proc. Natl. Acad. Sci. U.S.A. 100, 2386–2390 121 Hou, J., Jun, S.-H., Zhang, C. and Kim, S.-H. (2005) Global mapping of the protein structure space and application in structure-based inference of protein function. Proc. Natl. Acad. Sci. U.S.A. 102, 3651–3656 122 Efimov, A. V. (1997) Structural trees for protein superfamilies. Proteins 28, 241–260 123 Zhang, C. and Kim, S.-H. (2000) A comprehensive analysis of the Greek key motifs in protein β-barrels and β-sandwiches. Proteins 40, 409–419 124 Przytycka, T., Aurora, R. and Rose, G. D. (1999) A protein taxonomy based on secondary structure. Nat. Struct. Biol. 6, 672–682 125 Dokholyan, N. V., Shakhnovich, B. and Shakhnovich, E. I. (2002) Expanding protein universe and its origin from the biological Big Bang. Proc. Natl. Acad. Sci. U.S.A. 99, 14132–14136 126 Shakhnovich, B. E. (2005) Improving the precision of the structure–function relationship by considering phylogenetic context. PLoS Comput. Biol. 1, e9 127 Rose, G. D., Fleming, P. J., Banavar, J. R. and Maritan, A. (2006) A backbone-based theory of protein folding. Proc. Natl. Acad. Sci. U.S.A. 103, 16623–16663 128 Taylor, W. R. (2007) Evolutionary transitions in protein fold space. Curr. Opin. Struct. Biol. 17, 354–361 129 Taylor, W. R. (2002) A ‘periodic table’ for protein structures. Nature 416, 657–660  c The Authors Journal compilation  c 2009 Biochemical Society

130 Gerstein, M. and Levitt, M. (1997) A structural census of the current population of protein sequences. Proc. Natl. Acad. Sci. U.S.A. 94, 11911–11916 131 Gerstein, M. (1997) A structural census of genomes: comparing bacterial, eukaryotic and archaeal genomes in terms of protein structure. J. Mol. Biol. 274, 562–576 132 Gerstein, M. (1998) Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins 33, 518–534 133 Frishman, D. and Mewes, H.-W. (1997) Protein structural classes in five complete genomes. Nat. Struct. Biol. 4, 626–628 134 Wolf, Y. I., Brenner, S. E., Bash, P. A. and Koonin, E. V. (1999) Distribution of protein folds in the three superkingdoms of life. Genome Res. 9, 17–26 135 Frishman, D. and Mewes, H.-W. (1997) PEDANTic genome analysis. Trends Genet. 13, 415–416 136 Gough, J., Karplus, K., Hughey, R. and Cothia, C. (2001) Assignment of homology to genome sequences using a library of Hidden Markov Models that represent all proteins of known structure. J. Mol. Biol. 313, 903–991 137 Wilson, D., Madera, M., Vogel, C., Chothia, C. and Gough, J. (2007) The SUPERFAMILY database in 2007: families and functions. Nucleic Acids Res. 35, D308–D313 138 Buchan, D., Pearl, F., Lee, D., Shepherd, A., Rison, S., Thornton, J. M. and Orengo, C. (2002) Gene3-D: structural assignments for whole genes and genomes using the CATH domain structure database. Genome Res. 12, 503–514 139 Yeats, C., Lees, J., Reid, A., Kelam, P., Martin, N., Liu, X. and Orengo, C. A. (2008) Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 36, D414–D418 140 Teichmann, S. A., Rison, S. C. G., Thornton, J. M., Riley, M., Gough, J. and Chothia, C. (2001) Small-molecule metabolism: an enzyme mosaic. Trends Biotechnol. 19, 482–486 141 Teichmann, S. A., Rison, S. C. G., Thornton, J. M., Riley, M., Gough, J. and Chothia, C. (2001) The evolution and structural anatomy of the small molecule metabolic pathways in Escherichia coli . J. Mol. Biol. 311, 693–708 142 Apic, G., Gough, J. and Teichmann, S. A. (2001) An insight into domain combinations. Bioinformatics 17 (Suppl. 3), S83–S89 143 Apic, G., Gough, J. and Teichmann, S. A. (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311–325 144 Abeln, S. and Deane, C. M. (2005) Fold usage on genomes and protein fold evolution. Proteins 60, 690–700 145 Malek, J. A. (2001) Abundant protein domains occur in proportion to proteome size. Genome Biol. 2, research0039 146 Lin, J. and Gerstein, M. (2000) Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res. 10, 808–818 147 Deeds, E. J., Hennessey, H. and Shakhnovich, E. I. (2005) Prokaryotic phylogenies inferred from protein structural domains. Genome Res. 15, 393–402 148 Yang, S., Doolittle, R. F. and Bourne, P. E. (2005) Phylogeny determined by protein domain content. Proc. Natl. Acad. Sci. U.S.A. 102, 373–378 149 Wang, M., Yafremava, L. S., Caetano-Anoll´es, D., Mittenthal, J. E. and Caetano-Anoll´es, G. (2007) Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Res. 17, 1572–1585 150 Wang, M. and Caetano-Anoll´es, G. (2006) Global phylogeny determined by the combination of protein domains in proteomes. Mol. Biol. Evol. 23, 2444–2454 151 Fukami-Kobayashi, K., Minezaki, Y., Tateno, Y. and Nishikawa, K. (2007) A tree of life based on protein domain organizations. Mol. Biol. Evol. 24, 1181–1189 152 Doolittle, R. F. (2005) Evolutionary aspects of whole-genome biology. Curr. Opin. Struct. Biol. 15, 248–253 153 Woese, C. R., Kandler, O. and Wheelis, M. L. (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria and Eucarya. Proc. Natl. Acad. Sci. U.S.A. 87, 4576–4579 154 Wolf, Y., Rogozin, I. B. and Koonin, E. V. (2004) Coelomata and not Ecdysozoa: evidence from genome-wide phylogenetic analysis. Genome Res. 14, 29–36 155 Huerta-Cepas, J., Dopazo, H., Dopazo, J. and Gabald´on, T. (2007) The human phylome. Genome Biol. 8, R109 156 Glansdorff, N., Xu, Y. and Labedan, B. (2008) The Last Universal Common Ancestor: emergence, constitution and genetic legacy of an elusive forerunner. Biol. Direct 3, 29 157 Caetano-Anoll´es, G. (2002) Evolved RNA secondary structure and the rooting of the universal tree of life. J. Mol. Evol. 54, 333–345 158 Gough, J. (2005) Convergent evolution of domain architectures (is rare). Bioinformatics 21, 1464–1471 159 Forslund, K., Henricson, A., Hollich, V. and Sonnhammer, E. L. L. (2008) Domain tree-based analysis of protein architecture evolution. Mol. Biol. Evol. 25, 254–264 160 Winstanley, H. F., Abeln, S. and Deane, C. M. (2005) How old is your fold? Bioinformatics 21, i449-i458

Evolution of the protein world 161 Choi, I.-G. and Kim, S.-H. (2006) Evolution of protein structural classes and protein sequence families. Proc. Natl. Acad. Sci. U.S.A. 103, 14056–14061 162 Caetano-Anoll´es, G. and Caetano-Anoll´es, D. (2005) Universal sharing patterns in proteomes and evolution of protein fold architecture and life. J. Mol. Evol. 60, 484–498 163 Wang, M., Boca, S. M., Kalelkar, R., Mittenthal, J. E. and Caetano-Anoll´es, G. (2006) A phylogenomic reconstruction of the protein world based on a genomic census of protein fold architecture. Complexity 12, 27–40 164 Caetano-Anoll´es, G., Sun, F. J., Wang, M., Yafremava, L. S., Harish, A., Kim, H. S., Knudsen, V., Caetano-Anoll´es, D. and Mittenthal, J. E. (2008) Origins and evolution of modern biochemistry: insights from genomes and molecular structure. Front. Biosci. 13, 5212–5240 165 Caetano-Anoll´es, G., Kim, H. S. and Mittenthal, J. E. (2007) The origin of modern metabolic networks inferred from phylogenomic analysis of protein architecture. Proc. Natl. Acad. Sci. U.S.A. 104, 9358–9363 166 Pagel, M., Venditti, C. and Meade, A. (2006) Large punctuational contribution of speciation to evolutionary divergence at the molecular level. Science 314, 119–121 167 Sun, F.-J. and Caetano-Anoll´es, G. (2008) Evolutionary patterns in the sequence and structure of transfer RNA: early origins of Archaea and viruses. PLoS Comput. Biol. 4, e1000018 168 Xue, H., Tong, K. L., Mark, C., Grosjean, M. and Wong, J. T. (2003) Transfer RNA paralogs: evidence for genetic code–amino acid biosynthesis coevolution and an archaeal root of life. Gene 22, 59–66 169 Di Giulio, M. (2007) The tree of life might be rooted in the branch leading to Nanoarchaeota. Gene 401, 108–113 170 Di Giulio, M. (2008) The origin of genes could be polyphyletic. Gene 426, 39–46 171 Castresana, J. (2001) Comparative genomics and bioenergetics. Biochim. Biophys. Acta 1506, 147–162 172 Ranea, J. A. G., Sillero, A., Thornton, J. M. and Orengo, C. A. (2006) Protein superfamily evolution and the Last Universal Common Ancestor (LUCA). J. Mol. Evol. 63, 513–525 173 Ouzounis, C. A., Kunin, V., Darzentas, N. and Goldovsky, L. (2006) A minimal estimate for the gene content of the last universal common ancestor: exobiology from a terrestrial perspective. Res. Microbiol. 157, 57–68 174 Ma, B.-G., Chen, L., Ji, H.-F., Chen, Z.-H., Yang, F.-R., Wang, L., Qu, G., Jiang, Y.-Y., Ji, C. and Zhang, H.-Y. (2008) Characters of very ancient proteins. Biochem. Biophys. Res. Commun. 366, 607–611 175 Ji, H.-F., Kong, D.-X, Shen, L., Chen, L.-L., Ma, B.-G. and Zhang, H.-Y. (2007) Distribution patterns of small-molecule ligands in the protein universe and implications for origin of life and drug discovery. Genome Biol. 8, R176 176 Murzin, A. (1998) How far divergent evolution goes in proteins. Curr. Opin. Struct. Biol. 8, 380–387 177 Grishin, N. V. (2001) Fold change in evolution of protein structures. J. Struct. Biol. 134, 167–185 178 Ji, H.-F. and Zhang, H.-Y. (2007) Protein architecture chronology deduced from structures of amino acid synthases. J. Biomol. Struct. Dyn. 24, 321–323 179 White, S. H. (1994) Global statistics of protein sequences: implications for the origin, evolution, and prediction of structure. Annu. Rev. Biophys. Biomol. Struct. 23, 407–439 180 Taylor, W. R. (2006) Topological accessibility shows a distinct asymmetry in the folds of αβ proteins. FEBS Lett. 580, 5263–5267

637

181 Deane, C. M., Dong, M., Huard, F. P. E., Lance, B. K. and Wood, G. R. (2007) Cotranslational protein folding: fact or fiction? Bioinformatics 23, i142–i148 182 Chothia, C. (1976) The nature of the accessible and buried surfaces in proteins. J. Mol. Biol. 105, 1–12 183 Chow, C. C., Chow, C., Raghunathan, V., Huppert, T. J., Kimball, E. B. and Cavagnero, S. (2003) Chain length dependence of apomyoglobin folding: structural evolution from misfolded sheets to native helices. Biochemistry 42, 7090–7099 184 Dupont, C. L., Yang, S., Palenik, B. and Bourne, P. E. (2006) Modern proteomes contain putative imprints of ancient shifts in trace metal geochemistry. Proc. Natl. Acad. Sci. U.S.A. 103, 17822–17827 185 Raymond, J. and Segre, D. (2006) The effect of oxygen on biochemical networks and the evolution of complex life. Science 311, 1764–1767 186 Devos, D., Dokudovskaya, S., Williams, R., Alber, F., Eswar, N., Chait, B. T., Rout, M. P. and Sali, A. (2006) Simple fold composition and molecular architecture of the nuclear pore complex. Proc. Natl. Acad. Sci. U.S.A. 103, 2172–2177 187 Fuerst, J. A. (2005) Intracellular compartmentation in planctomycetes. Annu. Rev. Microbiol. 59, 299–328 188 Kurland, C. G., Collins, L. J. and Penny, D. (2006) Genomics and the irreducible nature of eukaryotic cells. Science 312, 1011–1014 189 Lazcano, A. and Miller, S. L. (1999) On the origin of metabolic pathways. J. Mol. Evol. 49, 424–431 190 Orgel, L. E. (2000) Self-organizing biochemical cycles. Proc. Natl. Acad. Sci. U.S.A. 97, 12503–12507 191 Orgel, L. E. (2000) Some consequences of the RNA world hypothesis. Origin Life Evol. Biosphere 33, 211–218 192 W¨achtersh¨auser, G. (2007) On the chemistry and evolution of the pioneer organism. Chem. Biodiversity 4, 584–602 193 Kim, H. S., Mittenthal, J. E. and Caetano-Anoll´es, G. (2006) MANET: tracing evolution of protein architecture in metabolic networks. BMC Bioinformatics 7, 351 194 Caetano-Anoll´es, G., Yafremava, L. S., Gee, H., Caetano-Anoll´es, D., Kim, H. S. and Mittenthal, J. E. (2008) The origin and evolution of modern metabolism. Int. J. Biochem. Cell Biol. 41, 285–297 195 Morowitz, H. (1999) A theory of biochemical organization, metabolic pathways, and evolution. Complexity 4, 39–53 196 Danchin, A., Fang, G. and Noria, S. (2007) The extant core bacterial proteome is an archive of the origin of life. Proteomics 7, 875–889 197 Kendrew, J. C., Bodo, G., Dintzis, H. M., Parrish, R. G., Wycoff, H. W. and Phillips, D. C. (1958) A three-dimensional model of the myoglobin molecule obtained by X-ray analysis. Nature 181, 662–666 198 Liolios, K., Tavernarakis, N., Huhenholtz, P. and Kyrpides, N. C. (2006) The Genomes On Line Database (GOLD) v2: a monitor of genome projects worldwide. Nucleic Acids Res. 34, D332–D334 199 Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 200 Wang, M. and Caetano-Anoll´es, G. (2009) The evolutionary mechanics of domain organization in proteomes and the rise of modularity in the protein world. Structure, in the press 201 Vogel, C. and Chothia, C. (2006) Protein family expansions and biological complexity. PLoS Comput. Biol. 2, e48

Received 10 October 2008/11 November 2008; accepted 17 November 2008 Published on the Internet 16 January 2009, doi:10.1042/BJ20082063

 c The Authors Journal compilation  c 2009 Biochemical Society

Biochem. J. (2009) 417, 621–637 (Printed in Great Britain)

doi:10.1042/BJ20082063

SUPPLEMENTARY ONLINE DATA

The origin, evolution, and structure of the protein world Gustavo CAETANO-ANOLLE´ S*1 , Minglei WANG*, Derek CAETANO-ANOLLE´ S*† and Jay E. MITTENTHAL† *Department of Crop Sciences, University of Illinois at Urbana-Champaign, 1101 W. Peabody Drive, Urbana, IL 61801, U.S.A., and †Department of Cell and Developmental Biology, University of Illinois at Urbana-Champaign, 601 S. Goodwin Avenue, Urbana, IL 61801, U.S.A.

Table S1

Average length and number of secondary structures in all-α, all-β, α/β and α+β protein classes

Structures: H, α-helix; B, residues in isolated β-bridge; E, β-strands (they participate in β-sheets); G, 310 -helix; I, π-helix; T, hydrogen-bonded turn; S, bend; C, coil. Values are means + − S.D. Structural properties of protein class Average length of segments All-α All-β α/β α+β Average number of segments All-α All-β α/β α+β Average total length of segments All-α All-β α/β α+β

1

H

B

E

G

I

12.78 (+ − 3.75) 6.54 (+ − 3.25) 10.50 (+ − 1.40) 10.78 (+ − 2.84)

0.68 (+ − 0.48) 0.98 (+ − 0.21) 0.99 (+ − 0.17) 0.84 (+ − 0.39)

2.18 (+ − 2.20) 5.67 (+ − 1.38) 4.90 (+ − 0.72) 5.66 (+ − 1.64)

2.77 (+ − 1.38) 2.97 (+ − 1.13) 3.36 (+ − 0.65) 2.87 (+ − 1.18)

0.45 (+ − 1.45) 0.39 (+ − 1.34) 1.12 (+ − 2.14) 0.33 (+ − 1.25)

6.82 (+ − 4.46) 1.95 (+ − 1.96) 9.00 (+ − 4.71) 4.52 (+ − 2.81)

1.15 (+ − 1.81) 2.46 (+ − 1.76) 2.95 (+ − 2.36) 2.06 (+ − 2.07)

1.60 (+ − 2.93) 10.43 (+ − 5.42) 9.41 (+ − 3.91) 6.87 (+ − 3.92)

1.52 (+ − 1.70) 1.53 (+ − 1.21) 3.11 (+ − 2.04) 1.77 (+ − 1.53)

85.14 (+ − 54.96) 18.80 (+ − 18.58) 94.36 (+ − 49.95) 48.19 (+ − 31.24)

1.77 (+ − 2.05) 3.08 (+ − 1.74) 3.52 (+ − 2.29) 2.57 (+ − 2.04)

9.83 (+ − 18.90) 59.13 (+ − 32.88) 45.99 (+ − 18.33) 38.80 (+ − 23.09)

6.19 (+ − 5.56) 6.38 (+ − 4.02) 11.31 (+ − 6.92) 6.82 (+ − 5.10)

T

S

C

2.00 (+ − 0.28) 2.18 (+ − 0.19) 2.06 (+ − 0.16) 2.07 (+ − 0.29)

1.51 (+ − 0.26) 1.66 (+ − 0.25) 1.51 (+ − 0.13) 1.60 (+ − 0.26)

2.02 (+ − 0.42) 1.84 (+ − 0.39) 1.85 (+ − 0.21) 1.93 (+ − 0.32)

0.013 (+ − 0.090) 0.00070 (+ − 0.0036) + 0.075) 0.021 (− 0.0090 (+ − 0.076)

8.01 (+ − 6.40) 8.69 (+ − 4.90) 14.80 (+ − 7.40) 8.53 (+ − 4.88)

7.72 (+ − 6.48) 10.56 (+ − 6.24) 15.35 (+ − 7.91) 9.48 (+ − 5.58)

12.79 (+ − 10.31) 19.51 (+ − 9.82) 26.99 (+ − 12.86) 17.03 (+ − 9.56)

0.45 (+ − 1.45) 0.39 (+ − 1.34) 1.12 (+ − 2.15) 0.33 (+ − 1.25)

16.18 (+ − 13.14) 19.16 (+ − 10.78) 30.58 (+ − 15.83) 17.95 (+ − 10.62)

11.95 (+ − 10.22) 17.68 (+ − 10.31) 23.22 (+ − 11.86) 15.16 (+ − 8.77)

25.57 (+ − 21.00) 36.31 (+ − 20.93) 50.53 (+ − 25.66) 32.99 (+ − 19.46)

To whom correspondence should be addressed (email [email protected]).  c The Authors Journal compilation  c 2009 Biochemical Society

G. Caetano-Anoll´es and others

Figure S2 Universal phylogenomic trees of proteomes reconstructed from an analysis of protein domains at different architectural levels

Figure S1 Major protein classes of globular proteins grouped according to features of secondary structure The DSSP program [1] that standardizes secondary structure assignment was used to calculate the average number (A), average length (B) and average total length (C) of segments of secondary structure in a peptide chain. All PDB files in SCOP version 1.67 were included (61175 peptide chains) in the analysis, and features were calculated from chains belonging to the same SCOP fold for all folds. Plots compared each feature of secondary structure with each other. The Figure shows only comparison of average total length of α-helical and β-strand segments for the all-α, all-β, α/β and α+β classes of globular proteins. Averages are described in Supplementary Table S1.

 c The Authors Journal compilation  c 2009 Biochemical Society

Construction of these trees involved a structural census that assigns domain structure to sequences. Trees were obtained from a fold-usage distance-based analysis of the occurrence of 338 F (SCOP version 1.35) in eight [2] and 20 [3] genomes (A), and from a maximum parsimony analysis of the abundance of 507 F (SCOP version 1.59) in 32 genomes [4] (B) and 1259 FSFs (SCOP version 1.67) in 185 genomes [5] (C) respectively. In some cases in (C), terminal leaves are not labelled with organismal names as they would not be legible. Arrowheads indicate the location of the root when using polarized characters. Organism abbreviations: Aaeo, Aquifex aeolicus ; Aful, Archaeoglobus fulgidus ; Aper, Aeropyrum pernix ; Atha, Arabidopsis thaliana ; Bbur, Borrelia burgdorferi ; Bsub, Bacillus subtilis ; Cace, Clostridium acetobutylicum ; Cele, Caenorhabditis elegans ; Cpne, Chlamydia pneumoniae ; Ctra, Chlamydia trachomatis ; Dmel, Drosophila melanogaster ; Drad, Deinococcus radiodurans ; Ecol, Escherichia coli ; Halo, Halobacterium sp.; Hinf, Haemophilus influenzae ; Hpyl, Helicobacter pylori ; Mgen, Mycoplasma genitalium ; Mjan, Methanococcus jannaschii ; Mpne, Mycoplasma pneumoniae ; Mthe, Methanobacterium thermoautotrophicum ; Mtub, Mycobacterium tuberculosis ; Ncra, Neurospora crassa ; Phor, Pyrococcus horikoshii ; Rpro, Rickettsia prowazekii ; Saur, Staphylococcus aureus ; Scer, Saccharomyces cerevisiae ; Spom, Schizosaccharomyces pombe ; Ssol, Sulfolobus solfataricus ; Stok, Sulfolobus tokodaii ; Syne, Synechocystis sp.; Taci, Thermoplasma acidophilum ; Tmar, Thermotoga maritima ; Tpal, Treponema pallidum .

Evolution of the protein world REFERENCES 1 Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577–2637 2 Gerstein, M. (1998) Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins 33, 518–534 3 Lin, J. and Gerstein, M. (2000) Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. Genome Res. 10, 808–818 4 Caetano-Anoll´es, G. and Caetano-Anoll´es, D. (2003) An evolutionarily structured universe of protein architecture. Genome Res. 13, 1563–1571 5 Wang, M., Yafremava, L. S., Caetano-Anoll´es, D., Mittenthal, J. E. and Caetano-Anoll´es, G. (2007) Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Res. 17, 1572–1585 6 Wang, M., Boca, S. M., Kalelkar, R., Mittenthal, J. E. and Caetano-Anoll´es, G. (2006) A phylogenomic reconstruction of the protein world based on a genomic census of protein fold architecture. Complexity 12, 27–40

Figure S3 Universal phylogenomic trees of architectures reconstructed from a genomic census of protein domain structure Trees of domain architectures at F (A) and FSF (B and C) levels were reconstructed from a protein domain census in 32, 185 and 584 genomes respectively ([5,6], and M. Wang, unpublished work). In all cases, the census involved identifying domains using PSI-BLAST or advanced HMMs of structural recognition and using different versions of SCOP as reference. The three evolutionary epochs of the protein world are overlapped to the trees and are labelled with different shades (architectural diversification, light green; superkingdom specification, salmon; organismal diversification, yellow) and follow previous definitions [5]. Terminal leaves are not labelled since they would not be legible. Branches in red delimit the birth of architectures after the appearance of the first architecture unique to a superkingdom (broken line). The Venn diagrams shows occurrence of architectures in the three superkingdoms of life. Note the relative decrease in number of FSF architectures corresponding to the organismal specification epoch in the tree of (C) due to newly discovered FSFs described in the last release of SCOP.

 c The Authors Journal compilation  c 2009 Biochemical Society

G. Caetano-Anoll´es and others

Figure S4

Evolution of biological function in the protein world

The evolutionary timeline shows the discovery of protein FSF architectures associated with different functional SUPERFAMILY subcategories in each superkingdom, with time measured by a relative distance in nodes from a hypothetical ancestral architecture at the base of the tree of architectures (Supplementary Figure S3B). The number of architectures are given as percentage of the total.

Received 10 October 2008/11 November 2008; accepted 17 November 2008 Published on the Internet 16 January 2009, doi:10.1042/BJ20082063

 c The Authors Journal compilation  c 2009 Biochemical Society