Research Article

Animal Evolution and the Molecular Signature of Radiations Compressed in Time

Science  23 Dec 2005:
Vol. 310, Issue 5756, pp. 1933-1938
DOI: 10.1126/science.1116759

This article has a correction. Please see:

Abstract

The phylogenetic relationships among most metazoan phyla remain uncertain. We obtained large numbers of gene sequences from metazoans, including key understudied taxa. Despite the amount of data and breadth of taxa analyzed, relationships among most metazoan phyla remained unresolved. In contrast, the same genes robustly resolved phylogenetic relationships within a major clade of Fungi of approximately the same age as the Metazoa. The differences in resolution within the two kingdoms suggest that the early history of metazoans was a radiation compressed in time, a finding that is in agreement with paleontological inferences. Furthermore, simulation analyses as well as studies of other radiations in deep time indicate that, given adequate sequence data, the lack of resolution in phylogenetic trees is a signature of closely spaced series of cladogenetic events.

Detailed knowledge of the phylogenetic relationships among Metazoa and their eukaryotic relatives is critical for understanding the history of life and the evolution of molecules, phenotypes, and developmental mechanisms. Currently, with the exception of the well-resolved phylogenetic history of the deuterostomes (1), the relationships between and within protostome and diploblastic metazoan phyla remain unresolved (25). The uncertainty surrounding metazoan relationships may result from analytical and biological factors such as insufficient amounts of available sequence data, mutational saturation, the occurrence of unequal rates of evolution between lineages, or the rapidity with which metazoan phyla diversified (37).

Recent investigations concerning two critical variables of phylogenetic experimental design—the number of taxa and amount of data used—have guided our approach to metazoan relationships. It has been shown that taxon number may not be as critical a determinant of phylogenetic accuracy (8, 9) as the choice of taxa (10). Thus, to investigate relationships among phyla at the base of the metazoan tree and within protostomes, we selected metazoans and closely related eukaryotes that included representatives from choanoflagellates, poriferans (one representative from each of the three poriferan classes), cnidarians (one representative from each of two of the three cnidarian classes), platyhelminths (two representatives), priapulids, annelids, mollusks, arthropods, nematodes, urochordates, and vertebrates (three representatives) (all taxa are listed in table S1).

The use of single or few genes is now recognized to be insufficient for the confident resolution of many clades (4, 11, 12). In contrast, analyses of larger amounts of data have robustly resolved relationships in many taxonomic groups (1114), even after allowance for a high percentage of missing data (1214). Thus, to increase resolution of metazoan relationships, we used experimental and bioinformatic approaches to assemble a data matrix composed of 50 genes from the 17 selected taxa (15). Gene sequences from five key taxa were obtained through an automated polymerase chain reaction and sequencing approach we devised for the systematic amplification of large amounts of gene sequence data from cDNA of any metazoan (15) (table S2). Gene sequences from the 12 other taxa were retrieved through bioinformatic means from public databases (15).

A 50-gene data matrix does not resolve relationships among most metazoan phyla. Despite the large amount of data from taxa spanning the Metazoa, analyses of this data matrix and subsets thereof under maximum likelihood (ML) and maximum parsimony (MP) (15) still failed to resolve most relationships (Fig. 1 and fig. S1). There was no significant support (defined as >70% bootstrap support) for the order of relationships between early branching metazoans or within protostomes. Resolution of well-established “superclades” (protostomes, bilaterians, vertebrates, and deuterostomes; and metazoans with choanoflagellates as an outgroup) was attained with moderate to high support, depending on the optimality criterion used. The recovery of these superclades suggests that our failure to resolve relationships within them is either due to aspects of our experimental design [such as systematic error artifacts resulting from violation of phylogenetic assumptions (16)] or may reflect a prevailing limit to the resolution of certain clades in deep time.

Fig. 1.

The lack of resolution in phylogenetic relationships among major metazoan phyla. Values above internodes correspond to support values from ML and MP analyses, respectively. Only internodes with significant support in at least one of the two analyses (ML and MP) or internodes present in majority-rule consensus trees of both analyses are drawn. Analyses were also performed by Bayesian inference (15) (fig. S1). Although certain analyses provided strong support for particular clades, analyses of different subsets of taxa produced significantly different and conflicting results (table S3).

One recurrent problem for phylogenetic inference in deep time is the phenomenon of long branch attraction (17), under which unrelated taxa with long branches can artifactually be placed together. To test whether the inclusion of taxa with long branches [as visually identified on the ML tree (fig. S1)] had an effect on support values, we analyzed the data matrix, excluding long-branched taxa singly or in combination. For example, a clade joining platyhelminths and nematodes received moderate support, but all taxa in the clade are characterized by conspicuously long branches (fig. S1). The removal of long-branched taxa had a negligible effect on support for most internodes (table S3). For example, although support for protostomes and bilaterians did increase after the exclusion of nematode and platyhelminth taxa, the resolution of nodes within these superclades was not improved.

Clade support values can be very sensitive to the presence of “rogue” taxa whose placement on the tree may be unstable (17). To test whether the presence of rogue taxa could be responsible for the low support in many internodes of the metazoan phylogeny, the least stable metazoan taxa in this data matrix were identified using leaf stability indices (15). Removal of these taxa had a negligible effect on support (table S3). Furthermore, tests of additional parameters, such as deviations in amino acid composition, did not account for the lack of resolution (15). These results suggest that the low support values obtained are not due to the instability (or deviation) of a small subset of taxa but are the result of a systemic lack of support for relationships among most taxa.

Because the choice of taxa did not account for the lack of resolution in many key branches of the metazoan tree, we then considered two potential analytical explanations: the amount of missing data contained in the data matrix and the total amount of data used. By necessity of experimental design, the data matrix lacked, on average, 20% of the potential data per taxon (table S4). However, large data sets can be surprisingly tolerant to a high fraction of missing data (13, 14, 18), and reanalyses of the data matrix excluding the priapulid and the mollusk (the two taxa with the highest percentages of missing data; 68 and 54%, respectively) did not lead to noticeable changes in support (table S3). A second potential explanation may be that the data matrix still contains too few informative characters to robustly resolve phylogenetic relationships among protostomes and early branching metazoans. However, sequence variation is abundant between metazoan taxa, with 56% of the 12,060 amino acid sites being variable and 31% of the sites being parsimony-informative (Table 1). Furthermore, MP site pattern and ML mapping (15) analyses suggest that the differences in resolution between clades do not result from the number of informative sites per se, but in how these sites are distributed among alternative topologies (fig. S2). In agreement with these results, two- to eightfold increases in the number of characters resampled by bootstrapping (15) led to small improvements in the resolution of most internodes (fig. S3 and table S3). These data suggest that neither the percentage of potential data missing nor the total amount of data in this data matrix can explain the lack of resolution among protostomes and early branching metazoans.

Table 1.

Statistical attributes of the amino acid sequence data matrix. Numbers of variable, parsimony-informative, and singleton sites for the 50-gene data matrix are shown, including 16 metazoan, 1 choanoflagellate, and 15 fungal taxa. Percentages are reported in parentheses. All statistical attributes for the metazoan taxon set were calculated with choanoflagellates included. The mean observed distance (± standard deviation) corresponds to the average proportion of amino acid sites that are different in all pairwise sequence comparisons in a taxon set. The mean estimated distance (± standard deviation) corresponds to the ML-estimated average proportion of amino acid sites that are different in all pairwise sequence comparisons in a taxon set (15).

Taxon set Number of sites Variable sites Informative sites Singleton sites Observed distance Estimated distance
All taxa 12060 8257 (68%) 6669 (55%) 1588 (13%) 29.2 ± 6.7 35.7 ± 11.1
Metazoa 12060 6782 (56%) 3701 (31%) 3080 (26%) 21.8 ± 4.6 23.4 ± 6.2
Fungi 12060 6533 (54%) 5015 (42%) 1518 (13%) 27.1 ± 5.8 31.7 ± 8.2

A remarkable contrast in phylogenetic resolution between two kingdoms. Given the time since the origin of Metazoa, another hypothesis is that mutational saturation (1921) may have erased the phylogenetic signal originally contained in proteins' variable sites. Alternatively, the lack of resolution may be the signature of a closely spaced series of cladogenetic events occurring early in the evolution of Metazoa (7). One means of testing these alternative explanations is by comparing the phylogeny of Metazoa to that of their natural sister kingdom, the Fungi (22). The validity of this comparison rests on the inference that both lineages originated within approximately the same geological time frame, which is supported by the fossil record of both Fungi (23) and Metazoa (24, 25), particularly recent finds in the Doushantuo Formation (551 to 635 million years old) (23, 26, 27), as well as molecular clock analyses in which multiple representatives of both kingdoms are included (2830).

The availability of genome sequence data from many species spanning the fungal kingdom enabled us to sample exactly the same type and amount of data across Fungi as we did for Metazoa. We generated a data matrix containing 49 of the same 50 genes used for the metazoan phylogeny from a select set of 15 taxa representing most major taxonomic groups within Ascomycetes and Basidiomycetes (table S1). Examination of evolutionary distances and models of amino acid evolution for the 49 orthologs in each of the two kingdoms suggests that the tempo and mode of molecular evolution in this set of 49 genes has remained similar across the two kingdoms (table S5). Furthermore, comparisons of evolutionary distances within this set of fungi and within their metazoan counterparts suggest that both clades have undergone similar amounts of evolutionary change, with Fungi exhibiting slightly higher mean distances [mean observed/estimated distances ± standard error: Metazoa, 21.8 ± 4.6%/23.4 ± 6.2%; Fungi, 27.1 ± 5.8%/31.7 ± 8.2% (Table 1)], a finding consistent with a similar date of origin (table S6).

Phylogenetic analyses of the data matrix containing both Metazoa and Fungi showed a remarkable contrast in the resolution obtained within each of the two kingdoms. The fungal clade was robustly resolved, with the overwhelming majority of fungal internodes (11 out of 13) being significantly supported, irrespective of optimality criterion used (Fig. 2 and fig. S4). These relationships are generally in agreement with previous studies (31). In contrast, again only 4 of 14 metazoan internodes were significantly supported under both optimality criteria.

Fig. 2.

The contrast in phylogenetic resolution between the clades of Metazoa and Fungi. Values above internodes are as in Fig. 1. Eleven out of 13 internodes in the fungal clade are significantly supported by both optimality criteria (ML and MP), whereas only 4 out of 14 internodes in the metazoan clade are significant. Analyses were also performed by Bayesian inference (15) (fig. S4 and table S3).

The early history of Metazoa as a radiation compressed in time. The contrast in the resolution of the fungal and metazoan trees shows that neither the type nor the amount of data is a limit to the resolution of relationships within metazoan superclades. Therefore, the explanation for the sharp contrast in resolution may lie in differences in the tempo and pattern of cladogenesis within the kingdoms. One explanation for the contrasting resolution observed in the metazoan and fungal trees may be differences in “stemminess: (32): a measure of the relative length of internal versus external branches. Theoretical work indicates that the accuracy of reconstruction is higher for trees exhibiting high stemminess (that is, trees with longer internodes and shorter terminal branches) (32). In agreement with these studies, phylogenetic resolution is higher in the fungal tree, which is characterized by long internodes (Fiala and Sokal's stemminess index F = 0.201), and poorer in the metazoan clade, where internodes are much shorter (F = 0.121) (15). These differences in degree of stemminess between the two kingdoms are also reflected in the distributions of parsimony-informative sites (Fungi/Metazoa = 5015/3701 sites) and singleton sites (Fungi/Metazoa = 1518/3080 sites) along the branches of the two trees (Table 1).

These contrasts in resolution depth, stemminess, and distribution of site categories between the two kingdoms are consistent with a history of major metazoan lineages characterized by closely spaced (tempo) series of cladogenetic events (pattern). Paleontological evidence also suggests a rapid tempo of cladogenesis near the origin of Metazoa approximately 600 million years ago, with poriferans (26), cnidarians (33), and at least certain bilaterians (34) making their first appearance within 50 million years. Thus, inferences from these two independent lines of evidence (molecules and fossils) support a view of the origin of Metazoa as a radiation compressed in time.

Identifying the limits to resolution of cladogenetic events in deep time by simulation analysis. It has been proposed that, given adequate data, phylogenetic resolution for cladogenetic events of Cambrian age occurring as close as 1 million years apart will be achieved (35). If true, with the amount of data used here, the lack of observed resolution would indicate extreme compression of the metazoan radiation. Alternatively, the limit of resolution for internodes in deep time may be much larger than previously suggested.

To better understand the potential limits to the resolution of series of closely spaced cladogenetic events in deep time, and to explore how to interpret the lack of resolution when large amounts of data are available, we conducted a simulation analysis. The radiation of mammalian orders is particularly well suited for addressing this issue because it occurred within a small window of time (42 million years) an estimated 107 million years ago (36), with many internodes estimated to span between 1 and 10 million years in length. We simulated the effect of increasing the elapsed time since the radiation on the phylogenetic accuracy of internodes within this 42-million-year window, given adequate data and a rigorous model of sequence evolution (15, 20, 37). If the proposed limits of resolution are in fact very small, the degree of resolution for all internodes should not be affected, because all internodes are dated as 1 million years in length or longer. However, if the limits of resolution are greater than has been postulated, then the degree of resolution for several internodes should decrease as we move deeper in time.

The results of the simulations show a negative correlation between the amount of time elapsed and the accuracy with which internodes are resolved (Fig. 3 and fig. S5). For example, whereas almost all internodes in simulations assuming a 107-million-year time span are resolved near 100% accuracy (Fig. 3A), the accuracy of several internodes in data sets simulating the lapse of a 600-million-year time span is low (Fig. 3, B to E), contrary to predictions that data matrices of this size and properties should attain accuracy levels of 95% across all internodes (35). These results suggest that the actual limits of resolution for closely spaced events in deep time is larger than previously thought (38). To estimate the actual limit of resolution, we plotted the phylogenetic accuracy for each internode against the internode's length (in million years), assuming a 600-million-year time span. Results suggest that, even when very large data matrices are used, and under simulation assumptions that most likely represent the best of circumstances when compared to biological data, many internodes with lengths much larger than 1 million years are resolved with accuracies well below 50% (Fig. 4). Thus, the limit of resolution of large data sets in deep time may differ by an order of magnitude from previous estimates (35).

Fig. 3.

Phylogenetic accuracy is inversely correlated with the length of time elapsed since a closely spaced series of cladogenetic events. A simulation analysis of increasing the age of origin of the 42-million-year window of mammalian order diversification is shown. (A) The best estimate of the mammalian phylogenetic tree at present under the molecular clock assumption (36) (time span of 107 million years). (B) The mammalian phylogenetic tree, assuming a 600-million-year time span. Branch lengths are shown in million-year time units. The topology and branch lengths within the 42-million-year window (left of the dashed grey line) of trees in (A) and (B) are identical. There is a compression in the lengths of internodes in the 42-million-year window of the tree in (B), due to the longer time span elapsed. (C) Graph showing the relationship between phylogenetic accuracy of internodes in the 42-million-year window and total time span simulated, after MP analysis of 100 simulated data matrices, each containing 16,000 characters (of which roughly 6000 are variable). (D) The same graph as in (C), but with simulated data matrices, each containing 73,000 characters (of which roughly 28,000 are variable). Similar results were obtained by neighbor-joining (NJ) analyses (fig. S5). Only 10 exemplar internodes are shown (all the internodes are shown in fig. S5). The numbers of internodes in all panels are according to (36).

Fig. 4.

The limit for resolution of cladogenetic events of Cambrian age, under the best of circumstances, may be an order of magnitude higher than previously thought. The phylogenetic accuracy with which internodes are resolved (the ordinate) is plotted against the length of each internode in million years (the abscissa). Data sets of two different lengths (data set in blue, 16,000 characters, of which 6000 are variable; data set in red, 73,000 characters, of which 28,000 are variable) were generated by simulation, assuming a tree with a 600-million-year time span. For many internodes with lengths much higher than 1 million years, resolution accuracy values are low, irrespective of data set size. Not all internodes of a given age exhibit the same resolution accuracy. For example, certain 3-million-year internodes are resolved with 100% accuracy, whereas other internodes of similar (or greater) age exhibit much lower values of resolution accuracy. Results are shown for analyses using MP [similar results are obtained by NJ analyses (fig. S7)].

The lack of resolution as a signature of events compressed in time. The resolution of other clades of the metazoan tree, in which cladogenetic events are thought to be much further apart than 1 million years, has also proved challenging, despite the use of large amounts of data. For example, fossil evidence suggests that the three major lineages of lobe-limbed vertebrates (lungfish, coelacanths, and tetrapods) first appeared within a time span of 20 to 30 million years approximately 390 million years ago (39). However, resolution of the relationships among these three lobe-limbed vertebrate lineages has not been obtained, despite analyses of more than 40 gene sequences from key taxa (40). The lack of resolution of lobe-limbed vertebrates, of metazoan phyla here and of other problematic groups that diverged in deep time such as the arthropods, coupled with the simulation studies, suggest that, given adequate sequence data, the lack of phylogenetic resolution is a positive signature of closely spaced cladogenetic events.

Of course, the ultimate objective of phylogenetics is to resolve the true branching order within such important groups. So what are the prospects for doing so? It has been argued that the use of even more gene sequences will increase the resolution of such radiations compressed in deep time (35, 40). However, the number of genes identifiable as orthologs, or usable in taxa that diverged in deep time, may actually turn out to be on the order of the number of genes currently being used in some studies (13, 40). If the maximum number of genes that could further be added to existing data matrices is not much greater, even the use of all conserved gene sequences across metazoan phyla or lobe-limbed vertebrates may not suffice for accurate reconstruction of certain clades. Furthermore, although increasing the gene number greatly reduces sampling error (11), the vulnerability to systematic error artefacts also increases (16), perhaps explaining how different phylogenetic analyses can reach contradicting inferences with absolute support (4143). In such cases, the use of alternative types of molecular characters, such as rare genomic changes (4446), and the development of more realistic models of character evolution (47) may hold the key to further progress in resolving closely spaced ancient diversification events.

References and Notes

View Abstract

Cited By...

Subjects

Navigate This Article