The origin and evolution of animals have remained hotly debated issues ever since Darwin drew attention to the relative paucity of fossils from the Precambrian, which ended 543 million years ago (Mya) (1). On the one hand, a growing collection of exquisitely preserved fossils of soft-bodied animals from the Cambrian has highlighted the existence of Cambrian representatives of most of the living animal phyla (2). This has given rise to the “Cambrian Explosion” hypothesis (3) that most animal phyla arose ~543 Mya within a short period (see the figure). On the other hand, studies of increasingly large molecular data sets suggest that many of the same phyla arose in the Precambrian (4-6), leading to considerable and occasionally intemperate debate on the timing of divergence of animal phyla. The article by Rokas et al. (7) on page 1933 in this issue presents a new aspect to this controversy.
Motivated by the desire to resolve the early evolution of animals, while accounting for insufficient data from some taxa, Rokas et al. used an alignment of 12,060 amino acids, encoded by 50 genes, to infer the phylogeny of 16 representatives from nine animal phyla. Eleven of these were from Porifera, Cnidaria, Platyhelminthes, Mollusca, Annelida, and Priapulida. These six phyla have not been adequately represented in recent molecular studies of early animal evolution (4-6). Given the aligned data, Rokas et al. inferred a phylogeny with several distinct speciation events that were consistently supported by the data, but with a conspicuous polytomy (multiple, concurrent divergence events) involving the protostome phyla, and another involving the bilaterate, cnidarian, and poriferan lineages. Recognizing that a polytomy may be due to rapid speciation, or to poor or insufficient data, Rokas et al. then explored whether convergence, “rogue” sequences, compositional heterogeneity, missing or inadequate data, or mutational saturation could have affected the phylogenetic estimate, and found that none of these were likely. In the absence of other explanations, they concluded that their data support rapid speciation during the early evolution of animals.
This conclusion is significant because it is consistent with the Cambrian Explosion hypothesis. Whereas previous molecular studies have concluded that the divergence of animal phyla occurred gradually over a period stretching hundreds of millions of years into the Precambrian (4-6), the present study suggests that two episodes of multiple divergence events occurred, each over just a few million years. This conclusion is consistent with a phylogeny based on fossil data (see the figure), thus reconciling a key difference in opinion between evolutionary biologists and paleontologists. However, Rokas et al. did not date these episodes, so corroboration of the Cambrian Explosion hypothesis is conditional on future estimates of the divergence dates not falling well within the Precambrian.
Is the conclusion drawn by Rokas et al. sound? At first glance, it appears so, but can their conclusion stand up to closer scrutiny? Accepting that the genes analyzed by the authors evolved without gene duplication and that the amino acids are aligned correctly, most phylogenetic methods assume that the evolutionary dynamics of the 12,060 amino acid sites are independently and identically distributed, and that they evolved under the same stationary, reversible, and homogeneous conditions (8). The assumptions arise from the need to render phylogenetic methods tractable and easy to use, and they are unlikely to be realistic. To account for the observation that the sites in a gene may evolve at different rates, some phylogenetic methods are able to model rate heterogeneity across sites using a Γ distribution (9). Rokas et al. used this approach for the whole alignment but did not consider that different parts of the alignment may require different Γ distributions. Nor did they consider that some sites may vary nonindependently (10) and that the distribution of variable sites may vary across lineages and through time, an issue that is notoriously difficult to resolve (11). A logical extension to the work would be to partition the alignment and estimate the evolutionary rates for different genes separately.
Violation of the assumed stationary, reversible, and homogeneous conditions may lead to compositional differences in the aligned amino acid sequences and hence to errors in phylogenetic estimates (12). Rokas et al. recognized this potential source of error but used a test that is known to be flawed, even though better tests are known (13). Furthermore, they chose a phylogenetic method that, while it accounts for compositional variation in the sequence alignment, is unsuitable: It assumes that the sites are independently and identically distributed, which they have already shown not to be the case. Moreover, they used a single Markov (probabilistic) model to analyze the alignment of amino acids, in effect using a “one size fits all” approach, where it would have been better to use several Markov models to capture gene-specific differences in the evolutionary processes (14).
Rokas et al. used nonparametric bootstrap and posterior probabilities to gauge support for the pattern and order of speciation events (branches in their phylogenetic tree). The former is widely recognized to be statistically unwise. Bootstrap values are estimates of the expected frequency with which speciation events (internal branches) occur in the optimal tree, using data constructed from the original alignment by sampling sites with replacement (15). It is not a measure of accuracy or confidence, but of data consistency. Further, the increase in bootstrap value when more genes are included may be misleading, because longer sequences naturally tend to have higher bootstrap values (see the figure). The posterior probabilities of speciation events being correctly identified are also prone to error when the phylogenetic assumptions are violated in the sense described above.
In light of these concerns, are the conclusions of Rokas et al. justified? Should we ignore their study? Most certainly not, because they have produced a wealth of data and have shown that it might just be possible that the fossil record can be reconciled with molecular data. This, in itself, should be cause for celebration and an incentive to acquire sequence data from the remaining 26 animal phyla. Likewise, it should encourage development of methods that assess when data violate phylogenetic assumptions, and that cope with such data. To achieve these goals, we need to know more about the structure and function of gene products before we can develop models that appropriately address the early evolution of animals.