Report

Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees

See allHide authors and affiliations

Science  19 Jun 2009:
Vol. 324, Issue 5934, pp. 1561-1564
DOI: 10.1126/science.1171243

Abstract

Inferring an accurate evolutionary tree of life requires high-quality alignments of molecular sequence data sets from large numbers of species. However, this task is often difficult, slow, and idiosyncratic, especially when the sequences are highly diverged or include high rates of insertions and deletions (collectively known as indels). We present SATé (simultaneous alignment and tree estimation), an automated method to quickly and accurately estimate both DNA alignments and trees with the maximum likelihood criterion. In our study, it improved tree and alignment accuracy compared to the best two-phase methods currently available for data sets of up to 1000 sequences, showing that coestimation can be both rapid and accurate in phylogenetic studies.

Phylogeny estimation from molecular sequences typically has two phases: An alignment is estimated, and then a tree is produced for the alignment. Alignment methods like MAFFT (1), Probcons (2), Probtree (3), Prank (4), and Muscle (5) provide more accurate alignments than earlier methods (3, 4, 6), and maximum likelihood (ML) methods of phylogeny estimation [e.g., RAxML (7, 8), GARLI (9), and Phyml (10)] produce more accurate trees for large data sets than other phylogeny estimation methods (10). Still, estimation of phylogenies and alignments for large data sets is difficult and highly inaccurate when the sequences examined for phylogenetic reconstruction have evolved with many substitutions and insertions and deletions (indels).

Alignment estimation methods typically estimate guide trees on which an alignment is then produced. The methods can be highly sensitive to the guide tree (3), and automated alignments often require manual realignment, sometimes necessitating months of intensive analysis of the sequences and taxa (11, 12). Manual realignment can be error prone due to limitations in alignment editors; the inability of an individual to view all the sequences simultaneously; or idiosyncratic, unstated realignment criteria. Also, hard-to-align DNA regions that might contain phylogenetically useful information may be rejected due to a lack of confidence in the alignments. Finally, regions that are hard to align are typically aligned differently by different programs, and phylogenies estimated on these different alignments can differ substantially (13). Consequently, molecular phylogenies often are produced using slowly evolving DNA regions that can be aligned relatively easily. Restriction to such regions likely has resulted in less fully resolved evolutionary histories for many groups due to underutilization of data.

Current methods for estimating trees directly from unaligned sequences include POY (14), POY* (15), ALIFRITZ (16), BAli-Phy (17), and the method of Lunter et al. (18), while SATCHMO (19) can be used for estimating trees directly from amino acid sequences. These methods potentially circumvent some of the aforementioned issues, but have limitations of their own. POY and POY* can be run on large data sets but have not produced trees more accurate than those of the best two-phase methods (15, 20). The other methods are based on parametric statistical models of sequence evolution including substitutions and indels. ALIFRITZ and BAli-Phy use models that allow multinucleotide indel events, engendering large computational burdens. A study assessing their performance (21) found that ALIFRITZ could analyze ~30 sequences, and BAli-Phy was limited to 10. Our studies of these methods on 100 taxon data sets (tables S28 and S29) showed that after 2 weeks of analysis, although BAli-Phy and ALIFRITZ were able to produce trees and alignments on some data sets, their trees were not as accurate as those produced by the best two-phase methods. A method that includes only single-nucleotide indel events (18) was reported to analyze data sets with 50 sequences (21); however, our attempts to test this method (fig. S4) revealed that the current implementation is not operational.

We present SATé (simultaneous alignment and tree estimation), a program designed to address current speed and accuracy limitations in phylogenetic analysis (available at www.cs.utexas.edu/users/tandy/science-paper.html). To make large-scale coestimation of trees and alignments feasible, we used a maximum likelihood rather than a Bayesian approach, with gaps treated as missing data. These choices improved scalability over ALIFRITZ and BAli-Phy, while improving accuracy over two-phase methods on large, hard-to-align data sets.

SATé searches for a tree/alignment pair with an optimal ML score by performing hill-climbing searches from a collection of starting tree/alignment pairs. For each starting alignment, SATé estimates an ML tree (general time reversible + gamma model) with RAxML. The “second stage” then uses an iterative, greedy search heuristic to find tree/alignment pairs with better ML scores (Fig. 1). In each iteration, a new alignment is proposed by our divide-and-conquer method: Center-Tree-i (CT-i) decomposition (Fig. 2). The default setting runs stage 2 for 24 hours, allowing the final iteration begun before the 24-hour limit to complete. The user can specify the stopping criterion, e.g., time limits or until no improvement in the ML score has occurred for 24 hours. If two or more starting tree/alignment pairs are used, the final tree/alignment pairs from each run are compared, and the one with the best ML score is returned.

Fig. 1

SATé’s second stage. Beginning with the current best tree/alignment pair, SATé iterates between realigning on the current tree and estimating a RAxML tree for each new alignment. At the end of each iteration, the tree/alignment pair optimizing ML under the GTR+Gamma model of evolution is saved.

Fig. 2

SATé’s divide-and-conquer strategy, illustrated with a CT-2 decomposition. A branch in the initial tree is selected, and the subtrees—A, B, C, D—around the branch are determined. The sequences for each of the four subtrees are realigned by MAFFT. These realigned subproblems are then aligned with one another, two at a time, using Muscle, until an alignment on the full data set is obtained. RAxML then computes a tree on the alignment. SATé iterates this process, as shown in Fig. 1.

For a CT-i decomposition, SATé computes a bifurcating tree with maximum diameter 2i − 1 around a branch selected in the current best tree. The branch is typically either a midpoint branch (the middle branch on a longest path in the tree) or a centroid branch, which splits the tree into two subtrees with roughly equal numbers of taxa. A CT-i decomposition produces at most 2i subsets, which are clades in the current tree (see Fig. 2 and Supporting Online Material for examples), where i is user-specified (default: i = 5), allowing the number and size of subproblems to be tuned for the number of taxa in the full data set. Sequences in each subset are realigned (default = MAFFT), and the alignments are progressively merged (default = Muscle), using the current tree as a guide tree, into an alignment on the full data set. An ML tree is estimated on the new alignment with RAxML. The tree/alignment pair with the better ML score (either the original pair or the new pair) is then used for the next iteration.

To determine the best defaults and options for SATé, we explored multiple combinations of the algorithmic parameters (22). We observed that CT-5 decompositions performed reliably well, and that starting with a tree/alignment pair having a good ML score improved SATé’s convergence rate, but that SATé was robust to the choice of starting alignment/tree pairs (figs. S7 to S10 and tables S20, S21, and S28).

Two variants of SATé performed particularly well. The first, SATé24, was fast and produced highly accurate alignments and trees. For its starting alignment/tree pair, SATé24 selects among four tree/alignment pairs by running RAxML on four alignments [ClustalW (23), Muscle, MAFFT, and Prank+GT, which is Prank provided with a RAxML-on-MAFFT guide tree] and picks the pair with the best ML score on its tree. SATé24 then uses its hill-climbing strategy for 24 hours to search for better solutions. The second variant, SATéBML, was slower but produced more accurate alignments and trees than the first. It uses two starting alignment/tree pairs (the ClustalW alignment and its RAxML tree, and the pair used by SATé24), lets each of the hill-climbing SATé analyses run until no change has occurred for 24 hours, and returns the solution with the best ML score.

We tested these SATé variants on both simulated and biological data sets, relative to the alignment methods ClustalW, MAFFT, Muscle, and Prank+GT, and to RAxML, run to completion.

For the simulation studies, we compared SATé24 to the two-phase methods, using ROSE (24) to simulate sequence evolution under 30 models (20 replicates per model, root sequences of 1000 base pairs, trees of 500 and 1000 taxa) representing a wide range of conditions (tables S1 to S3). We used ROSE’s output to determine the true alignment for each data set and the model-tree branches on which no changes occurred (zero-event branches). We used the true alignment to calculate SP-FN alignment error rates. SP-FN is the sum of pairs for the false-negative error rate: the percentage of evolutionarily homologous pairs of nucleotides missing in the estimated alignment and the complement of the SP accuracy measure (25). We contracted zero-event branches of the model trees—because reconstruction of these branches is effectively random (26)—producing potentially inferable model trees (PIMTs). We used the PIMTs to quantify the percentage of branches missing from an estimated tree but present in the PIMT (the missing branch rate).

RAxML runs on the true alignment consistently produced a low proportion of missing branches (averaging 7.4% overall), even when run on models whose true alignments contained high proportions of indels and high substitution rates (Fig. 3 and fig. S6), indicating that treating indels as missing data introduced minimal error in tree estimation, even for data sets containing many indels. SATé24’s missing branch rates (averaging 9.1% overall) were only somewhat higher than those obtained by RAxML on the true alignment (Table 1). By comparison, the overall average missing branch rates for RAxML trees produced on estimated alignments were 12.6% for MAFFT, 13.0% for Prank+GT, 15.3% for Muscle, and 21.6% for ClustalW (Table 1). On these data sets, runtimes for SATé24’s second stage were at most 28 hours for 500 taxa and 34 hours for 1000 taxa (table S9).

Fig. 3

The 1000-taxon model results. All x axes show the 15 1000-taxon models, from easy to hard, based on missing branch rates. From top-to-bottom panels, the y axes are missing branch rate, alignment SP-FN error, true alignment setwise statistics, and Spearman rank correlation coefficients (ρ). All data points include SE bars. For the top two panels, models on the x axis followed by an asterisk indicate that SATé’s performance was significantly better than the nearest two-phase method (paired t tests, setwise α = 0.05, n = 40 for each test).

Table 1

Missing branch rates and alignment SP-FN error rates. For each statistic and model, the best-performing methods (within 1%) are in boldface. The best overall is indicated with an asterisk (*). For each “All models” entry, n = 600; for each easy entry, n = 120; for each moderate-to-difficult entry, n = 180.

View this table:

We labeled some models “easy” because all methods produced trees with accuracy close to that of RAxML on the true alignment (Fig. 3 and fig. S6). For the remaining “moderate-to-difficult” models (Table 1, Fig. 3, and fig. S6), SATé24 produced the most accurate trees and alignments, ClustalW produced the least, and the other methods were intermediate. SATé24’s missing branch rates on the moderate-to-difficult models were significantly better than those produced by the closest two-phase method in all cases (P ≤ 0.004; paired t test) except that of model 1000M3, whose performance was similar to that obtained by running RAxML on a MAFFT alignment (Fig. 3 and table S13). For all methods, missing branch rates and alignment error rates tended to increase as the number of taxa (Table 1) or rate of evolution increased (Fig. 3); however, SATé24’s error rates generally increased much less than those of the other methods. Thus, SATé24 provided the largest advantage on the hardest data sets. SATé24’s worst average missing branch rate (17.4%) was on model 1000M1, on which all other methods had error rates >30% (Fig. 3). Furthermore, SATé24 achieved these performance gains on standard desktop computers with 4 GB of RAM.

Because tree error may increase when unreliable sites are included, we estimated ML trees on data sets from which sites were eliminated using four techniques [including two variants of Gblocks (27)]. These techniques either had little impact on tree estimation accuracy or made the estimations worse (tables S31 to S34).

SATé24’s improved tree/alignment accuracy arose from its divide-and-conquer strategy and use of the ML criterion. The CT-5 decomposition improved starting alignments and trees on the moderate-to-difficult models (table S11) and modified (without necessarily improving) them on the easy models (table S12). The ML score was positively correlated with alignment and tree accuracy across all models, especially the moderate-to-difficult models (Fig. 3, fig. S6, and tables S14 and S15). Furthermore, even when SATé24 proposed alignments that were accepted but resulted in trees of reduced accuracy, the average reduction in alignment or tree accuracy was at most 0.5% (table S11).

SATé sometimes found better trees and alignments with different starting tree/alignment pairs (figs. S7 and S9 and table S28). Therefore,we recommend multiple runs of SATé using more than one starting pair, e.g., using SATéBML if time permits.

We compared SATéBML to the two-phase methods on six biological ribosomal RNA data sets (117 to 1028 taxa; table S7) for which highly reliable, curated alignments are available. SATéBML and MAFFT alignments were of comparable accuracy and better than the other alignments (table S18), and SATéBML trees were more accurate than those produced by any two-phase method (table S19).

Thus, SATé is a fast, effective, and fully automated method for simultaneous estimation of trees and alignments for nucleotide sequences on large numbers of taxa. For rapidly evolving sequences, it dramatically speeds and improves tree and alignment estimations compared to the best current two-phase methods, and it removes or significantly reduces the need for hand realignment of data sets. Finally, it makes possible the use of DNA regions previously rejected for alignment difficulty, potentially leading to significantly improved resolution of species’ relationships.

Supporting Online Material

www.sciencemag.org/cgi/content/full/324/5934/1561/DC1

Materials and Methods

SOM Text

Figs. S1 to S10

Tables S1 to S34

References

References and Notes

  1. Materials and methods are available as supporting materials on Science Online.
  2. This research was supported by the U.S. NSF under grants DEB 0733029, ITR 0331453, ITR 0121680, EIA 0303609, and IGERT 0114387. The simulated and biological data sets, true and curated alignments, PIMTs, and all software are available at www.cs.utexas.edu/users/tandy/science-paper.html.
View Abstract

Navigate This Article