Phylogenetic Signal in the Eukaryotic Tree of Life

See allHide authors and affiliations

Science  04 Jul 2008:
Vol. 321, Issue 5885, pp. 121-123
DOI: 10.1126/science.1154449


Molecular sequence data have been sampled from 10% of all species known to science. Although it is not yet feasible to assemble these data into a single phylogenetic tree of life, it is possible to quantify how much phylogenetic signal is present. Analysis of 14,289 phylogenies built from 2.6 million sequences in GenBank suggests that signal is strong in vertebrates and specific groups of nonvertebrate model organisms. Across eukaryotes, however, although phylogenetic evidence is very broadly distributed, for the average species in the database it is equivalent to less than one well-supported gene tree. This analysis shows that a stronger sampling effort aimed at genomic depth, in addition to taxonomic breadth, will be required to build high-resolution phylogenetic trees at this scale.

Reconstruction of the phylogenetic history of a large sample of life on Earth is nearly within reach. Molecular sequence data are available in GenBank for 10% of described species diversity (1); improvements in algorithms and high-performance computing technology have dramatically increased the scale of feasible phylogenetic inference (2); and unconventional sources of data, including whole genomes (3), expressed sequence tag libraries (4, 5), and barcode sequences (6), have altered the landscape of large-scale phylogenetics with an infusion of new evidence. The pace of phylogenetic discovery has accelerated to the point where nearly complete phylogenetic trees can be constructed for well-studied clades, such as mammals (7). Such high-resolution trees—those including all taxa for which data are available—permit strong inferences regarding problems ranging from conservation biology (8) to comparative biology (9) to reconstructing ancestral genomes (10). The phylogenetic distribution of species in GenBank is remarkably broad, as a visualization of the National Center for Biotechnology Information (NCBI) taxonomy tree (11, 12) shows (Fig. 1 and figs. S1 and S2). Construction of a high-resolution phylogenetic tree containing all eukaryotic species in the database is a grand challenge that is substantially more tractable than inferring the entire tree of life, but to succeed, strategies will have to overcome serious sampling impediments. Quantifying the distribution and strength of phylogenetic evidence currently in the database is a prerequisite for this effort.

Fig. 1.

Phylogenetic support across the NCBI taxonomy tree of eukaryotes. The tree displays 876 taxonomic orders. Not shown are 251 orders from the original selection of 1127 with fewer than four OTUs, which could not contain phylogenetically informative clusters by definition (13). Rectangles for each order are colored blue if they exhibit minimum phylogenetic support (1.5 support units or higher) or yellow if they do not. The radial length of these rectangles is proportional to the log of diversity (number of OTUs) within that order (black circles provide scale for diversity). Arcs are labeled with selected major eukaryotic clades. See fig. S1 for a high-resolution image of this figure showing all ordinal names. See the SOM for parallel results for a rank-free partition of the NCBI tree.

The NCBI taxonomy tree provides a convenient framework for organizing a series of phylogenetic analyses of the 181,992 eukaryotic taxa having sequence data, henceforth termed operational taxonomic units (OTUs) (13). I partitioned the NCBI tree into 1127 higher taxa at the rank of order for further analysis {table S1 [see (13) for an alternative rank-free partition, which yielded similar results]}. I then downloaded from release 1.01 of the PhyLoTA Browser database [(14, 15) on the basis of GenBank release 159] 14,289 potentially phylogenetically informative clusters of homologous sequences assembled for each higher taxon. Each cluster has a minimum of four OTUs, which is necessary to provide resolution in an unrooted tree. Unrooted phylogenetic trees were constructed for each cluster with a fast but conservative (16, 17) procedure taking both alignment and phylogenetic uncertainty into account. Any clade in the resulting tree will have had at least 50% bootstrap support in maximum parsimony “fast” bootstrap analyses (18) with two different sequence alignment algorithms (19, 20). Although this protocol biases the confidence assessment slightly downward, the bias is small (13). Of greater concern is that the sequence data used here are enriched for taxonomic diversity to the relative exclusion of some high-throughput genomics data (13), which, though presently available for only a small fraction of eukaryotic taxa, ultimately should enable stronger phylogenetic inferences.

Phylogenetic support for each cluster was measured by the fraction of clades resolved on the final alignment-merged bootstrap consensus tree [its consensus fork index (13)]. Phylogenetic support for each OTU in the NCBI tree was measured by the sum of the support measures of all phylogenetically informative sequence clusters that contained it. Finally, phylogenetic support for a higher taxon, H, was measured by the mean support score of the OTUs contained within it, which is the weighted sum, Embedded Image, where wk(H) is the support score for cluster k in higher taxon H, nk(H) is the number of OTUs in cluster k, and nH is the number of OTUs in H. This support score was selected among many possibilities in part because of its relative insensitivity to the size of the higher taxon [Fig. 2; see supporting online material (SOM) text]. For comparative purposes and to aid in the visualization of results, an arbitrary cutoff value of 1.5 was selected as minimal phylogenetic support. This is equivalent, for example, to the information content of two independent loci, each resolving three-quarters of clades to at least a bootstrap value of 51%.

Fig. 2.

Dependence of mean phylogenetic support on ordinal diversity. Mean phylogenetic support in taxonomic orders is not dependent on the diversity of these taxa [n = 876, P = 0.18 (not significant), R2 = 0.002: The solid line is a linear regression, which is slightly curved in the graph's log scale], which suggests that the variation seen across eukaryotes is due not so much to the size of the order but to its phylogenetic position. The horizontal dotted line is at a support value of 1.5 units, the level corresponding to minimum phylogenetic support in this study. Vertical dotted lines are placed at diversities of 10, 100, and 1000 OTUs to correspond with the diversity scale in Fig. 1. Taxon names for the 10 highest-scoring orders are indicated. Two are land plants (Canellales, an angiosperm; and Haplomitriales, a liverwort order); the other eight are all vertebrates (two bird, one amphibian, and five mammalian orders).

Among individual OTUs, Homo sapiens had the maximum support value of 293.9, but the distribution of scores had a long tail leading to 6402 OTUs with no support at all (most of which, 6079, simply were not found in any phylogenetically informative clusters). The top 10 were all mammals; the top 25 were mammals, angiosperms (tomato, potato, tobacco, rice, and wheat), Drosophila melanogaster, and Drosophila simulans, all with support scores above 60 units. Of the 171,703 OTUs for which scores were calculated, only 12% achieved minimal phylogenetic support. The mean support was 0.84, less than the equivalent of each taxon being found in at least one well-resolved and -supported phylogenetic tree.

The mean support values of orders were skewed (tables S4 and S5), ranging from a maximum of 10.0 in primates to 0.0 in 75 other orders (mean = 0.93 among the 876 with at least four OTUs; mean number of clusters = 1.88). The phylogenetic position of orders with minimal support is shown in Fig. 1, as well as the order's species richness. A very similar picture emerges in the rank-free partition of the NCBI tree (fig. S2). In the ordinal partition, only 14% of higher taxa achieved minimal support. The support within those orders in the NCBI tree that happened to be species-rich (≥100 OTUs) provides a good indication of potential for high-resolution tree inference (tables S2 and S3). Their phylogenetic position appears nonrandom. In vertebrates, 16 of 57 species-rich orders had minimal support, but across arthropods only 1 out of 45 did, the acalyptrates (containing Drosophila), and none of the other species-rich metazoan orders did. Fungi, with 40 diverse orders, had only 1 order that achieved minimal phylogenetic support; and angiosperms, with 45 species-rich orders, had only 3, still far short of vertebrate performance. Many areas of the NCBI tree, including the vast diversity of metazoa that are neither vertebrate nor arthropod, as well as the diversity of microbial eukaryotes, have few orders, species-rich or -poor, that achieved minimal phylogenetic support. Some taxa with surprisingly low support exemplify how biological diversity can overwhelm substantial and sustained phylogenetic efforts. Examples include the legumes and grasses in angiosperms, two groups containing the bulk of the most economically important and well-studied plants on Earth (21, 22); and the huge subgroup of ascomycete fungi, the Pezizomycotina (23, 24), containing numerous plant and animal pathogens, sources of antibiotics such as penicillin, and organisms used in human food production.

The finding that the average eukaryotic species or higher taxon in GenBank has a phylogenetic support score of less than 1.0 units (10 times less than the best-supported vertebrate orders) has several implications. An accurate high-resolution phylogeny will require substantial increases in sequence data to bring that score to a level comparable to that of the best-supported higher taxa. Although improved phylogenetic inference tools, such as new methods of inferring species trees from collections of gene trees (25, 26), may ultimately extract more power from the same quantity of data, new sampling strategies will also be needed to both acquire and warehouse specimens for DNA work (such as in DNA banks at major natural history collections) and to survey the largest number of relevant genomic sequences per sample. Sampling efficiency can be improved dramatically by targeting the addition of new sequences to the right clusters. One target is the large number of currently phylogenetically uninformative clusters that would become informative with the addition of just a few sequences. Loci can also be targeted in newly acquired species by paying attention to the size and support value of clusters already constructed for related taxa, a practice followed informally by systematists but which can now be quantified with some precision (13). For example, in the angiosperm clade Solanales, the five clusters contributing most to support are, in decreasing order, nuclear ribosomal DNA internal transcribed spacers, plastid ndhF, nuclear GBSS, the plastid trnL spacer region, and plastid rbcL, so these are obvious targets for further sampling. Recent advances in applying information theory may make possible morenuanced sampling algorithms that take into account cluster sequence variation, number of taxa, and phylogenetic depth of the tree (27). None of these considerations address the difficult sampling issue of undescribed species, most of which lie in the regions of Fig. 1 that are already least well-supported. These are not just absent from GenBank and Fig. 1, they are unknown to science. They can only be added through more biodiversity surveys and alpha taxonomic work. In the meantime, sampling protocols guided by quantitative assessments of the phylogenetic distribution of data will improve the efficiency of emerging phylogenomic strategies for building the tree of life of known organisms.

Supporting Online Material

Materials and Methods

Figs. S1 to S3

Tables S1 to S5

References and Notes

View Abstract

Stay Connected to Science

Navigate This Article