## Abstract

The use of molecular phylogenies to examine evolutionary questions has become commonplace with the automation of DNA sequencing and the availability of efficient computer programs to perform phylogenetic analyses. The application of computer simulation and likelihood ratio tests to evolutionary hypotheses represents a recent methodological development in this field. Likelihood ratio tests have enabled biologists to address many questions in evolutionary biology that have been difficult to resolve in the past, such as whether host-parasite systems are cospeciating and whether models of DNA substitution adequately explain observed sequences.

Evolutionary biology is founded on the concept that organisms share a common origin and have subsequently diverged through time. Phylogenies represent our attempts to reconstruct this evolutionary history, and there is probably more interest in phylogenetic reconstruction today than at any time in the past. For years phylogenetics played a relatively minor role in evolutionary biology, and it is only in the past decade that the importance of phylogeny in most branches of biology has been fully recognized (1, 2). Today it is not uncommon to see phylogenies applied in fields far removed from evolutionary biology. For example, they have found a practical use in tracing routes of infectious disease transmission and in identifying the relationship of pathogens, such as the New Mexico hantavirus (3).

With the realization that phylogeny can provide answers to many questions of interest in evolutionary biology, there has been an explosion in the number of statistical tests that take phylogeny into account. In part, this is because an essentially infinite number of possible tests can be applied to any biological question. A hypothesis test involves calculating a test statistic from the data and then determining the probability of the observed statistic if the hypothesis were true; the probability is obtained from the null distribution of the test statistic (that is, the distribution if the hypothesis is true). For hypothesis tests involving phylogeny, the null distribution is usually generated by either permuting data matrices or resampling from the original data. However, the statistical properties of many tests based on such procedures are known to be poor, and although permutation of data matrices is a common procedure, the null hypothesis for many such tests is often not well defined (4). Similarly, although nonparametric bootstrapping is widely used to evaluate the support of the data for a particular phylogeny, the statistical interpretation of bootstrap values remains problematic (5).

The past 5 years have seen remarkable advances in the use of parametric statistical tests of questions involving phylogeny. In particular, increased computing speed, more realistic models of DNA substitution, and improved computer programs have led to practical statistical tests using likelihood ratios and Monte Carlo simulation procedures. Although statistical tests can be constructed in many different ways (1,6), we concentrate in this review on likelihood ratio tests (LRTs) for several reasons. First, LRTs have the same status in hypothesis testing as does maximum likelihood in parameter estimation. That is, just as maximum likelihood estimates (MLEs) are known to have desirable statistical properties such as consistency, LRTs are known to outperform other hypothesis tests under many conditions. For example, LRTs are known to be optimal (uniformly most powerful) when comparing simple hypotheses, and LRTs often perform well for cases in which no optimal test is known (7). Second, many applications of LRTs do not assume that the phylogeny is known. This is an advance over tests that assume that the phylogeny is known without error (1) because all existing methods of phylogeny reconstruction are subject to both systematic and random errors. In many cases, the error in phylogeny estimation can be large (8). Third, LRTs provide a unified framework for testing hypotheses.

## Maximum Likelihood and Hypothesis Testing

Maximum likelihood estimation of phylogenetic trees was first introduced by Edwards and Cavalli-Sforza in the early 1960s (9). Felsenstein (10) implemented the method for DNA sequence data, and most recent advances have focused on the analysis of DNA sequences. Stated simply, the MLE of phylogeny is the tree for which the observed data are most probable. For the present purposes, the data are aligned DNA sequences for *s* species. The first step in a likelihood analysis is to calculate the probability of the observed sequences; this probability depends on an explicit mathematical model of evolution (11). The model consists of two parts: (i) a phylogenetic tree with branch lengths defined in terms of the expected number of substitutions per site, and (ii) a model of the process of DNA substitution (that is, specifying the probability of the occurrence of a nucleotide substitution at a particular site over the length of a branch). For many studies the phylogenetic tree is the only parameter of interest, but in the course of finding the maximum likelihood tree, other parameters are estimated that may also be of importance (such as the transition rate– transversion rate bias).

Much attention has focused on the accuracy of the phylogenetic trees reconstructed by maximum likelihood. Simulation studies suggest that maximum likelihood is typically more accurate (that is, more likely to predict the actual evolutionary tree) and robust (that is, less sensitive to incorrect models and assumptions) than other methods of phylogenetic inference (12, 13). Moreover, likelihood provides a natural means of hypothesis testing (14). The LRT statistic for comparing two hypotheses (Λ) is defined as
(1)The likelihood *L* is maximized under both the null and alternative hypotheses. The likelihood ratio provides a measure of the support of the data for one hypothesis versus another. If Λ > 1, the data are more probable under the null hypothesis, and this is favored; the alternative hypothesis is favored if Λ < 1. When nested hypotheses are examined (that is, the null hypothesis is a special case of the more general, alternative hypothesis), Λ will always be <1 and −2 log Λ is approximately χ^{2} distributed under the null hypothesis with*q* degrees of freedom, where *q* is the difference in the number of free parameters between the null and alternative hypotheses. Alternatively, the probability of observing a given Λ if the null hypothesis were correct (the significance level) can be calculated by using Monte Carlo simulations, as explained below (15).

Although LRTs have a long history in statistics, they have had only a limited application in phylogenetics, with the first application of an LRT (a test of the molecular clock) proposed in 1981 (10). Why has it taken so long for LRTs to be applied in phylogenetic analysis? One problem concerns the use of topology as a model parameter. It is known that many of the standard results for LRTs do not apply to phylogenetic trees (16). For example, in considering nested phylogenetic hypotheses, the usual χ^{2}approximation to the distribution of the test statistic often cannot be used to determine the significance of the LRT statistic (16). This problem can be avoided, however, by generating null distributions using computer simulation (16, 17). In this procedure, known as parametric bootstrapping or Monte Carlo simulation, the null distribution of the test statistic is calculated by simulating many data sets (Fig. 1). Monte Carlo simulation has been widely used in statistics since the early 1960s (15). Model parameters for the simulations are estimated from the original data under the null hypothesis. The likelihood ratio is calculated for each simulated data set, and the proportion of the replicates in which the likelihood ratio calculated using the original data is exceeded for the simulated data represents the significance level of the test.

Table 1 lists several hypotheses involving phylogeny for which LRTs are available. LRTs have been applied to problems such as the relative fit of models of DNA substitution to sequence data and the evaluation of evidence for the monophyly of a taxonomic group. For many of the questions posed in Table 1, alternative tests are available, some of which are claimed to be nonparametric. However, all statistical tests involving phylogeny require assumptions about the evolutionary process, even though an explicit model may not be used. Assumptions about the process of evolution are required, for example, when estimating a phylogenetic tree. One of the advantages of LRTs is that model assumptions can themselves be tested and potentially improved.

## Tests of Models of DNA Substitution

All phylogenetic methods make assumptions, whether explicit or implicit, about the process of DNA substitution. Systematists are in an awkward situation in that they know the assumptions of a phylogenetic method are imperfect. Yet they also know that the match between the process of nucleotide substitution generating the sequence variation and the substitution model assumed may be critical. The realism of substitution models is important because methods for inferring phylogeny may be less accurate, or may be inconsistent (that is, converge to an incorrect tree with increased amounts of data), in situations where the model is incorrect (8, 13, 18). Evolutionary biologists also have an intrinsic interest in accurately modeling the processes that produce variation in DNA sequences and thereby improving our understanding of molecular evolution. Molecular systematists interested in phylogenetic inference have long been troubled by the question of how to choose the optimal substitution model for a particular data set. Maximum likelihood provides a rational method for choosing substitution models for phylogenetic analysis through the use of LRTs.

Current models implemented in phylogenetic inference using maximum likelihood (and several other methods as well) assume that DNA substitutions follow a Poisson process. The most general model allows each type of nucleotide substitution to have an independent rate parameter (there are 12 rate parameters in total) (19). Also, rate heterogeneity among sites can be accommodated by assuming that rates are distributed among different sites according to some probability distribution (usually a gamma, Bernoulli, or log-normal distribution), or by assigning sites to different rate classes (for example, first, second, and third codon positions) and then estimating the substitution rate for each class (20). The models implemented in likelihood have also been modified to allow parameters to be estimated separately for different data partitions or for different branches of the phylogenetic tree (21, 22). In short, the substitution models used in a phylogenetic analysis can be made arbitrarily complex by the addition of parameters, each of which can be estimated using likelihood methods.

One approach to the choice of models in phylogenetic analysis is to use a very complicated (parameter-rich) model for which a large number of free parameters will result in a high likelihood. However, this approach has several disadvantages. First, because a large number of parameters must be estimated for complicated models, the analysis becomes computationally difficult. Second, the error associated with each parameter estimate is higher for more complicated models than for simple ones. This decrease in accuracy appears to apply to all parameters of the phylogenetic model, including the topology; in certain cases, the accuracy of the estimated phylogeny may be improved by using a simpler model (although this is not universal) (12,13). Finally, an overly complicated model may not be needed to account for the observed data. Occam’s razor provides a principle for choosing among hypotheses that explain a set of observations equally well; the simpler (most parsimonious) hypothesis is preferred. Although a complicated model may make the observed data more probable, it will not necessarily provide a significant improvement in the likelihood over a model with fewer parameters.

How can the model be chosen that best fits the data without introducing superfluous parameters? One approach is to compare the likelihoods of different models using an LRT (10, 16, 23). The significance of the LRT statistic (Λ) can be approximated using simulation or, if the models are nested, by comparing −2 log Λ to a χ^{2}distribution, with *q* degrees of freedom, where *q*is the difference in the number of free parameters between the null and alternative models of DNA substitution.

For illustrative purposes, we applied this procedure to mitochondrial cytochrome oxidase I (COI) DNA sequences gathered by Hafner *et al.* (24) for 13 species of gophers and their associated lice (Table 2). First, we examined the molecular clock hypothesis (10). This hypothesis is satisfied if DNA substitutions follow a Poisson process and the mean rate of substitution has remained constant in different lineages. The log likelihood calculated under the clock hypothesis is log*L* = –2243.26 for the gophers and log *L*= –2782.23 for the lice when a simple model of DNA substitution is used. A more general model assumes that each branch of the phylogenetic tree has a unique unconstrained rate of substitution. This introduces*s* – 2 additional parameters; the likelihood for this latter model is therefore higher than that under the molecular clock hypothesis (log *L* = –2227.98 for the gophers and log*L* = –2776.18 for the lice). Because the models are nested (that is, equal rates among lineages are a special case of the unrestricted model) and the phylogenetic tree is held constant, the statistic –2 log Λ can be compared with a χ^{2}distribution with *s* – 2 degrees of freedom to determine the significance of the test (10). In this case, the molecular clock hypothesis cannot be rejected for either the gophers or the lice. The same LRT procedure applied to the models of DNA substitution shows that the best-fitting model for the gophers and the lice allows for different rates for transitions and transversions, unequal base frequencies, and among-site rate heterogeneity (25).

The ability to choose among models in performing a phylogenetic analysis is one of the great strengths of a likelihood approach. For many widely used phylogenetic methods, there are no generally accepted criteria for choosing among possible evolutionary models [but see (26)]. For example, the maximum parsimony method allows many types of data to be analyzed under a large class of substitution models or “weighting schemes,” but few criteria exist for choosing among weighting schemes. Methods for choosing models are important because different models may lead to different conclusions about phylogeny. Much of the arbitrary nature of model choice is eliminated by using a likelihood framework; when different substitution models provide different estimates of phylogeny, the tree associated with the best-fitting model is preferred.

The study of substitution models using LRTs has also provided molecular evolutionists with insights about how the process of DNA substitution operates. Application of LRTs indicates that some of the parameters of models of DNA substitution, which reflect the biology, are very important. For example, accounting for among-site rate heterogeneity almost always provides an improved fit of the model to the data [there is not as significant an improvement for pseudogenes, for which selection has been relaxed (27)]. The improvement in the likelihood obtained by adding among-site rate heterogeneity is usually so great that formal consideration of the significance level is unnecessary. However, LRTs also allow tests of much more subtle hypotheses, such as the way in which the process of substitution differs across the genome (21).

## Tests for Phylogenetic Association

One of the most innovative and useful applications of phylogenies involves the comparison of topologies estimated for different partitions of a data set (for example, different genes) for different species. If the partitioned data share a common evolutionary history, then the topologies estimated from each should be congruent. A comparison of topologies from different data partitions has been used to identify horizontal gene transfer in bacteria and fungi (28); horizontal gene transfer may be suspected if the tree estimated using one gene is different from the tree estimated using another gene for the same set of species. Similarly, comparison of tree topologies has been used to examine the rate of reassortment of the RNA segments in the hantavirus (29). The hantavirus has three negative sense RNA segments; when more than one virus infects a cell, the opportunity exists for reassortment of the infecting viral segments among the progeny. If genetic reassortment plays an important evolutionary role in the hantavirus, then the trees estimated for the same set of viruses from different segments should be [and are (29)] different. Finally, a comparison of the phylogenies for hosts and parasites is a critical step in determining whether they have cospeciated. Cospeciation of hosts and associated parasites is invoked if the branching patterns and speciation times of the host and parasite trees agree (30).

Although many important questions can be addressed in the areas of evolutionary biology and epidemiology by comparing phylogenetic trees for different species or different genes, until recently there have been few statistical criteria for deciding whether the trees are in agreement. A likelihood approach uses a LRT of the hypothesis that trees estimated for different data partitions, or different species, are congruent [that is, the phylogenetic history is the same (31)]. The null hypothesis for the LRT of congruence is that the same topology underlies different data partitions; the likelihood is maximized under this constraint, but other parameters of the evolutionary model (such as the branch lengths or the transition rate–transversion rate ratio) are estimated independently for each data partition. The likelihood under the alternative hypothesis relaxes the constraint that the same topology underlies all data partitions, although all other aspects of the model are the same.

The LRT of congruence has been successfully used to explore questions of host-parasite cospeciation (25). In closely associated host-parasite systems, an allopatric speciation event in a host lineage might be expected to isolate parasite populations associated with each incipient host species, thereby producing a simultaneous allopatric speciation event among parasites. A history of cospeciation in host and parasite lineages should then be reflected by congruent phylogenies for hosts and their associated parasites. What does application of the LRT of congruence indicate about cospeciation in the gopher-louse system? The LRT statistic for the null hypothesis (that the phylogenies for gophers and lice are congruent) is much smaller than would be expected if the null hypothesis were true (32). Hence, although the trees for the gophers and lice are similar (24, 25, 33), the gophers and lice did not strictly cospeciate; host-switching by the lice, persistence of multiple ancestral louse lineages, or both must be invoked to explain the differences between the phylogenetic trees.

Are there any portions of the gopher-louse tree that are congruent and suggest cospeciation? Analysis of a subset of the associated gopher and louse species (the top five gopher and louse species of Fig. 2) suggests that these gopher and louse species have cospeciated. A more refined LRT suggests that the speciation times of the associated gopher and louse species are also identical. The null hypothesis for a LRT of “temporal cospeciation” assumes that the tree and the relative branch lengths for host and parasite phylogenies are the same but that the overall rate of substitution for the two trees may differ (25). The alternative hypothesis relaxes the constraint that the branch lengths for the host and parasite trees are proportional. The null hypothesis that the branching times are identical cannot be rejected, which is consistent with a model of cospeciation for five of the associated gopher and louse species. Because these species appear to have cospeciated, we can also examine whether the substitution rate differs between gophers and lice (24, 25, 33). An LRT of the null hypothesis that the substitution rates are identical in hosts and parasites reveals that the substitution rate is much higher in lice than in gophers [3.01 ± 0.53 times the rate for gophers (25)]. This rate difference may have several biological explanations, including a higher mutation rate in lice or a shorter generation time (24).

## Prospects for Likelihood Ratio Tests in Phylogenetics

The field of phylogenetics has seen remarkable advances in the past 40 years; the principal aim has progressed from reconstructing phylogenies, with little concern for sources of error, to evaluating the reliability of trees and (more recently) addressing biological questions using phylogenies. Maximum likelihood and LRTs have played an important role in the development of phylogenetics and should continue to provide a source for advances. In many ways, testing evolutionary hypotheses that are dependent on phylogeny presents an unusual and difficult statistical problem to the evolutionary biologist. However, it appears that standard statistical approaches may be applied successfully. We have shown that LRTs can be used to study a wide range of biological questions, such as the fit of a substitution model to sequence data and the agreement of phylogenies estimated from different data sets. However, the application of LRTs in phylogenetics is a relatively recent phenomenon, and the range of questions that can be addressed by LRTs is currently limited (Table 1). For example, several questions of general interest in biology, such as whether two or more characters are correlated (1), can be addressed using LRTs only in restricted circumstances (34). Moreover, questions concerning morphological evolution are difficult to address using LRTs because realistic models of morphological evolution are generally lacking.

Although LRTs have proven useful for studying a variety of biological hypotheses, several unresolved questions remain concerning the general utility of the approach. Few studies have examined the power of LRTs for testing particular phylogenetic hypotheses, or whether such tests are biased (16, 35). Another problem involves the computational expense of the hypothesis testing procedure; the likelihood is repeatedly maximized for many simulated data sets, and this can quickly stress the computer resources of most research laboratories. A potential solution to this problem is to perform a small number of replicates and then fit a probability distribution, such as a χ^{2} or gamma, to the simulated likelihoods. Also, simple LRTs may not be appropriate in all situations. Methods of sequential analysis are needed when a hypothesis is originally tested using one data set and later reexamined using additional data (36).

Explicit model-based methods are a recent innovation in phylogenetics. One advantage of these approaches is that the exact hypothesis being tested is clear if the test is properly formulated. These methods also offer the possibility that evolutionary models may be gradually improved as new biological processes are discovered and incorporated into the models used for phylogenetic analysis. Statistical approaches to phylogenetic inference have led to many improvements in our understanding of the process of DNA substitution over the past decade, allowing a much broader range of biological questions to be examined in a rigorous way.

↵* To whom correspondence should be addressed. E-mail: johnh{at}mws4.biol.berkeley.edu