Toward Automatic Reconstruction of a Highly Resolved Tree of Life

See allHide authors and affiliations

Science  03 Mar 2006:
Vol. 311, Issue 5765, pp. 1283-1287
DOI: 10.1126/science.1123061

This article has a correction. Please see:


We have developed an automatable procedure for reconstructing the tree of life with branch lengths comparable across all three domains. The tree has its basis in a concatenation of 31 orthologs occurring in 191 species with sequenced genomes. It revealed interdomain discrepancies in taxonomic classification. Systematic detection and subsequent exclusion of products of horizontal gene transfer increased phylogenetic resolution, allowing us to confirm accepted relationships and resolve disputed and preliminary classifications. For example, we place the phylum Acidobacteria as a sister group of δ-Proteobacteria, support a Gram-positive origin of Bacteria, and suggest a thermophilic last universal common ancestor.

Reconstructing the phylogenetic relationships among all living organisms is one of the fundamental challenges in biology. Numerous attempts to derive a tree of life using various methods have been published [for a review, see (1)], and its principal existence has been questioned recently (2, 3). Moreover, even under the assumption of a tree of life, numerous groupings and taxonomic entities still remain heavily debated, and the advent of molecular and genomic data has increased the variety of classifications rather than reducing the problem (1). Theoretical and practical limits to reconstructing a tree of life have been put forward, such as the insufficient amount of discriminating characters available, even in information-rich genomic data sets (4), and the computing resources required to cope with large numbers of species (1). Furthermore, there are factors that hamper accurate reconstruction of phylogenetic trees regardless of the methods used, such as sampling biases of species included (5) and dilution of phylogenetic signal by horizontal gene transfer (HGT) (6), the extent of which is still extremely controversial (2, 3, 7). In addition to these difficulties, different data sets have been used with a variety of methods and parameter settings, making it almost impossible to quantitatively compare the proposed results. Hence, there exists the challenge and requirement for a reproducible and updatable pipeline to reconstruct the tree of life by means of a commonly available data set, such as completely sequenced genomes. Here, we demonstrate the feasibility of the tree construction and present a phylogeny based on an alignment of sufficient length and resolution to accurately calculate comparable branch lengths across all three domains of life. We have created for this purpose a supermatrix of 31 concatenated, universally occurring genes with indisputable orthology in 191 species with completely annotated genomes (Fig. 1 and table S1). Although initial identification and analysis of these genes required considerable manual effort (8), the inclusion of additional species with completely annotated genomes has pipeline character (Fig. 1). Because the 31 universal genes are all involved in translation, we applied the same tree-building procedure to independent sets of domain-specific nontranslational genes (8).

Fig. 1.

Overview of the procedure. The white boxes represent the major steps for building the pan-domain phylogeny presented here. Steps in gray represent automatable parts of the procedure that need to be carried out for including further species. For the 31 clusters of orthologous groups (COGs) used in the analysis, we manually derived 1:1 orthologs by removing mitochondrial and chloroplast paralogs from corresponding multiple alignments. We built domain-specific alignments by using corresponding proteins encoded by the 31 orthologs and aligned the resulting profiles. With this procedure, we maximized the number of positions of the global alignment and reduced the number of misaligned residues. For a detailed description of the methods, see (8).

For the tree reconstruction, we mostly used standard approaches (Fig. 1) with the exception of a procedure for detection and selective exclusion of HGTs, which turned out to be essential for obtaining a highly resolved tree. We started with 36 genes universally present in all 191 species for which orthologs could be unambiguously identified (8) and eliminated five of them from the analysis (mostly tRNA synthetases) because they have undergone multiple horizontal transfers and/or were difficult to align (Fig. 1 and table S1). Although the 31 remaining genes are unlikely to be subjected to lateral transfers because they mainly encode for ribosomal proteins (9), we systematically tested them for any HGT event not yet identified. We randomly allocated the 31 gene products into four groups, and for each group we derived the corresponding subsets of trees where each protein was in turn missing from the alignment (resampling with displacement: jackknife test). We subsequently checked for topological incongruence within each subset of trees and further tested candidate HGTs by two other independent measures (8). If at least one of these two measures could confirm the jackknife indication, the gene was considered horizontally transferred and removed from the corresponding alignment (Fig. 1 and table S2).

Our approach [confirmed by single tree analysis (8)] detected a total of 7 HGT candidates [i.e., orthologous gene displacements (10)] among 31 orthologs from 191 species, with some species being involved in more than one HGT event (table S2). Three out of the four aminoacyl-tRNA synthetases (aa-RSs) used in this analysis have undergone HGT, including Valyl-RS (COG0525), which had been reported before (11), thus confirming the mobility of these enzymes (12). Clostridia is the only class that acquired ribosomal proteins by lateral transfer, likely in a single ancient event, because the displaced orthologs are present in all sequenced Clostridia species (table S2). To our knowledge, only one other horizontal transfer of ribosomal proteins has been reported so far (13). The identification of 7 HGTs in the 31 translation-related genes compares with the 30 (10 per domain) lateral transfers detected in domain-specific trees from 24 nontranslational genes (8).

Species-specific exclusion of HGTs and concatenation of all gene product alignments resulted in a supermatrix of 8090 positions for 191 species. This supermatrix was subsequently used to reconstruct the tree of life shown in Fig. 2 (14).

Fig. 2.

Global phylogeny of fully sequenced organisms. The phylogenetic tree has its basis in a cleaned and concatenated alignment of 31 universal protein families and covers 191 species whose genomes have been fully sequenced (14). Green section, Archaea; red, Eukaryota; blue, Bacteria. Labels and color shadings indicate various frequently used subdivisions. The branch separating Eukaryota and Archaea from Bacteria in this unrooted tree has been shortened for display purposes.

The global tree topology was supported by two independent measurements: First, by using domain-specific subtrees from nontranslational genes we could confirm the monophyly of all major divisions and reproduce most of their branching orders (8), albeit with weaker statistical support. This is due to lower sequence coverage and/or conservation as well as a higher number of excluded characters because of the higher incidence of HGTs (8). Secondly, three independent tests carried out on individual gene trees revealed that, although they are not identical, they share similarities with both the obtained tree of life and with each other (8). Although it may be possible to reject the null hypothesis of each of these tests without much difficulty, their combined evidence suggests that the gene trees have a cohesive phylogenetic signal.

Within the tree of life, as many as 65% of the branches are supported by a bootstrap proportion (BP) of 100%, and 81.7% have more than 80% BP support, enabling us to propose resolutions to debated classifications at both the root and the tips of the tree (Table 1). Although in Prokaryota statistical support for deeper branches is generally weaker than that for the recent ones, it is noteworthy that, within Bacteria, the Firmicutes appear to comprise the earliest branching phylum, in agreement with a proposed Gram-positive ancestor for all Bacteria (15) (Fig. 2 and Table 1). In our tree, Firmicutes are placed at the earliest division of Eubacteria with 66% BP support, and 33% of remaining BP show at least a subclade of Firmicutes at the earliest division. This placement and the fact that the 15 slowest evolving taxa of the Bacteria are all Gram positive (8) support the theory of a Gram-positive origin of Bacteria. Furthermore, the thermophilic Firmicute Thermoanaerobacter tengcongensis is the taxon with the shortest overall phylogenetic distance to the root of Bacteria (Fig. 2) and as such is most likely to have retained ancestral states (16). Together with the fact that slowest evolving, ancestral Archaea (table S7) are also (hyper)thermophilic (8), this lends support to the hypothesis that the last universal common ancestor was living at high temperatures.

Table 1.

Noteworthy selected features of the tree of life phylogeny that are novel, debated, or difficult to reproduce according to current literature. An extended version of the table is available as table S6. In the case of Firmicutes as the earliest branching bacterial phylum, it is noteworthy that the remaining 33% of the BP show at least a subclade of the Firmicutes at the earliest division.

DomainTopological featureBP (%)
Coelomata hypothesis 100
Eukaryota Amoebozoa related to Opistokonta 41
Deep branching of Diplomonadida 100
Relationships within phyla
Separation between β- and γ-Proteobacteria 100
Disruption of Chroococcales monophyly 100
Disruption of Actinomycetales monophyly 100
Acidobacteria-Proteobacteria clade 98
Cluster of F. succinogenes next to the Chlorobium-Bacterioidales (Sphingobacteria hypothesis) 62
Cluster of F. nucleatum with hyperthermophylic Bacteria 36
Eubacteria Relationships between phyla
Grouping of Chlamydiae, Spirochetes, Actinobacteria, and Bacteriodales-Chlorobi 67
Grouping of Cyanobacteria, hyperthermophylic, and Deinococcales-Chloroflexi 51
Relationships between super-phyla
Grouping of Proteobacteria with Cyanobacteria, hyperthermophylic, and Deinococcales-Chloroflexi 74
Deep branching of Firmicutes 66
Relationship within phyla
A. fulgidus with halobacterium and methanosarcina 99
Archaeabacteria Relationship between phyla
Nanoarchaea as a sister branch of Crenarchaea 100

At the base of the Proteobacteria, the monophyletic Acidobacteria appear as a sister group to the δ-Proteobacteria (Fig. 2). The 64% BP support for this relationship indicates that the Acidobacteria may be a sixth divergent class within Proteobacteria. The Proteobacterial-Acidobacterial monophyly is supported with a BP of 98%, further raising the question whether Acidobacteria should indeed be an independent phylum (17).

Toward the tips of the tree, within Cyanobacteria Synechococcus (sp.WH8102) groups with Prochloroccales and Nostoc groups with Synechocystis, a result that has been supported by some ribosomal RNA (rRNA) studies (18) and challenges the classical order Chroococcales (19).

Within Archaea, the position of Nanoarchaeota remains debated [e.g., (20)].We find (with 100% BP support) that they are a sister group of Crenarchaeota, without an indication of reported HGTs from Crenarchaeota (20) in all core genes studied.

Within Eukaryota, our tree gives clear support for the classical Coelomata hypothesis that groups Arthropoda with Deuterostomia (chordates) in a monophyletic clade. This is in contrast to the “new animal phylogeny” that groups nematodes and arthropods into the monophyletic Ecdysozoa (21, 22). The ecdysozoan clade has been supported by small subunit rRNA and single-gene phylogenies [(23) and references therein] but has been rejected by a number of recent studies on the basis of genomics features and whole-genome phylogenies [(24) and references therein]. Current sampling biases and accelerated evolution of sequenced representatives of certain metazoan lineages (e.g., arthropods and nematodes) (Fig. 2) may factor in these results. This highlights the need for the sequencing of slow-evolving species (16), which may resolve such controversies in the tree.

Despite a highly resolved and robust tree, we cannot exclude a few uncertainties in tree topology due to biased species sampling or long branch attraction (LBA) (25). For example, the grouping of Thermotoga and Aquifex in our and other trees might be partially caused by their common thermophilic life-styles (26), whereas LBA might account for the placement of diplomonadida (Giardia lamblia) as the most basal eukaryal taxon (Table 1).

The use of a common protein set across all three domains of life also ensures that the observed branch lengths are comparable across the entire tree. This enables, for example, an objective, quantitative analysis of the consistency of traditional taxonomic groupings (Fig. 3). As expected, the hierarchy of taxonomic groups correlates with phylogenetic diversity measured between and within them (e.g., species belonging to the same family have a shorter branch length distance than species belonging only to the same phylum). Within each taxonomic level, branch lengths distances vary considerably (27), apparently owing to factors that influence substitution rates, such as differences in life-style or population size. However, even when taking this effect into account, we observe a strong discrepancy between taxonomic divisions within Eukaryota and Prokaryota (Fig. 3A). Organisms that have been assigned to separate phyla in Eukaryota would clearly belong to the same phylum in the prokaryotic classification. Historically, eukaryotes have obviously been given more taxonomic resolution than prokaryotes, a testament to their greater morphological diversity.

Fig. 3.

Global analysis of branch length information. (A) Average sequence divergence within taxonomic classification units. Each data point denotes a pairwise comparison of two taxa, relating their intertaxa branch-length distance (i.e., sequence divergence) with their level of relatedness according to the National Center for Biotechnology Information taxonomy (“taxonomy distance”). Horizontal bars denote 95% intervals and medians of the data. Some minor taxonomy hierarchy levels have been omitted. Marked items: (Point a) Homo sapiens versus Pan troglodytes. The sequence divergence between human and chimp is low; they most likely would have been assigned the same genus if they had been prokaryotes [see also (30) for a proposed revision]. (Point b) Synechococcus (sp. WH8102) versus Prochlorococcus marinus 9313. The two species are annotated as distinct orders, but they appear quite closely related, challenging the classical order of Chroococcales. (B) Evolutionary speed and genome size. For each taxon, cumulative branch lengths from the tip to the root is plotted against genome size (measured here as number of genes). Dashed lines are linear regressions.

Another universal trend is that smaller genomes evolve faster [i.e., have longer branch lengths (Fig. 3B)]. This has been noted before for pathogenic or endosymbiontic organisms with reduced genomes, and it is easily explained because they have only limited capabilities to remove mutations by means of recombination or DNA repair (28, 29). However, we observe this trend also for genomes of larger sizes, including free-living prokaryotes and eukaryotes. Intriguingly, there is not a single organism sequenced that is fast-evolving and has a large genome (Fig. 3B). This suggests that the coupled processes of genome reduction and evolutionary acceleration may be irreversible: Genomes apparently do not grow again after a prolonged phase of genome reduction.

The pan-domain phylogeny that resulted from the procedure presented here will increase in resolution with more species being sequenced. This updatable reference phylogeny of completely sequenced species allows accurate comparisons of branch lengths across domains. The resulting tree of life will be an invaluable tool in many areas of biological research, ranging from classical taxonomy, via studies on the rate of evolution, to environmental genomics where DNA fragments of unknown phylogenetic origin need to be assigned.

Supporting Online Material

Materials and Methods

SOM Text

Figs. S1 to S3

Tables S1 to S7

References and Notes

References and Notes

Stay Connected to Science

Navigate This Article