Phylogenetic MCMC Algorithms Are Misleading on Mixtures of Trees

See allHide authors and affiliations

Science  30 Sep 2005:
Vol. 309, Issue 5744, pp. 2207-2209
DOI: 10.1126/science.1115493


Markov chain Monte Carlo (MCMC) algorithms play a critical role in the Bayesian approach to phylogenetic inference. We present a theoretical analysis of the rate of convergence of many of the widely used Markov chains. For N characters generated from a uniform mixture of two trees, we prove that the Markov chains take an exponentially long (in N) number of iterations to converge to the posterior distribution. Nevertheless, the likelihood plots for sample runs of the Markov chains deceivingly suggest that the chains converge rapidly to a unique tree. Our results rely on novel mathematical understanding of the log-likelihood function on the space of phylogenetic trees. The practical implications of our work are that Bayesian MCMC methods can be misleading when the data are generated from a mixture of trees. Thus, in cases of data containing potentially conflicting phylogenetic signals, phylogenetic reconstruction should be performed separately on each signal.

Bayesian inference is one of the most popular methods in phylogeny reconstruction (1). Many widely used software packages, such as MrBayes (2), BAMBE (3), and PAML (4), rely on Markov chain Monte Carlo (MCMC) methods. These algorithms are often known as BMCMC. Part of the appeal of BMCMC is that they are supposed to be more robust and faster than standard maximum likelihood approaches. Our results show that these appealing features are overly optimistic in some settings.

The basis of the MCMC algorithms is a Markov chain whose stationary distribution is the desired posterior distribution. Reliable phylogenetic estimates depend on the Markov chain converging to the posterior distribution before any phylogenetic inferences are made. Typically it is elementary to establish that the Markov chains eventually converge to the posterior distribution. However, convergence after an infinite number of iterations is not of practical use. The chains need to converge quickly to the posterior distribution in order to be considered useful. Unfortunately, it is notoriously difficult to rigorously analyze the convergence rate of the Markov chains used for phylogeny. In practice, heuristics [such as multiple starting states and convergence of log-likelihood plots (5)] are commonly used to determine when the Markov chains have converged to the posterior distribution.

The major difficulty in analyzing these Markov chains is our poor understanding of the geometry of the space of tree topologies weighted by the likelihood function (this geometric space is often referred to as the tree space). Work has been done on the likelihood function for fixed trees on three and four leaves (6, 7) [abstract properties of tree space on any number of leaves also have been analyzed (811)].

In our work, we consider N characters generated from a uniform mixture of two trees on n ≥ 5 leaves and show that BMCMC takes an exponential number of iterations to converge. Our proofs also yield detailed information on the geometry of the likelihood function on the tree space for five or more leaves. It has its basis in combinatorial analysis techniques that are further discussed in (12).

The bases of our results are the following two trees of five taxa: T1 is given by ((12),3),(45) in the standard Newick format, whereas T2 is given by ((14),3),(25); see the yellow trees in Fig. 1 for an illustration. Our results apply to cases where the data are generated from a uniform mixture of two trees, Math and Math, on n ≥ 5 leaves, where for some subset S of leaves, the subtree of Math on S is Math, and the subtree of Math on S is Math. Moreover, the branch lengths on subtrees Math and Math must lie in the following zone: The branch lengths of terminal branches (i.e., edges incident to the leaves) are between a and a2; the branch lengths of internal branches are between a and 2a; and the number a satisfies 0 < a < b, where b is some small positive constant. The assumptions on the branch lengths of T1 and T2 are essential in the details of the proof.

Fig. 1.

The space of phylogenetic trees on five taxa connected by NNI transitions. For the mixture distribution on T1 and T2 in the CFN model with internal branch lengths a = 0.1 and terminal branch lengths a2 = 0.01, the yellow trees, with optimal branch lengths, had expected log likelihood J(T) ≈ –1.887, the green trees had J(T) ≈ –1.934, and the blue trees had J(T) ≈ –1.986. For the Jukes-Cantor model with a = 0.1, the yellow trees had J(T) ≈ –4.293, the green trees had J(T) ≈ –4.338, and the blue trees had J(T) ≈ –4.544. Note, for SPR and TBR transitions, each yellow tree is connected to 12 other trees but not to each other. Thus, to travel between the two yellow trees by NNI, SPR, or TBR transitions, we need to traverse through a valley (i.e., trees with lower log-likelihood).

For each of the trees Math and Math, the character data are generated by using any of the standard mutation models, such as the Cavender-Farris-Neyman (CFN) model, the Jukes-Cantor model, and Kimura's two parameter model [see (13) for an introduction to these models]. Moreover, our results hold for almost any prior distribution on branch lengths used in BMCMC including those discussed in (14, 15); see (12) for more details.

Our results are valid for two families of BMCMC. In the first family, the MCMC performs a random walk on the discrete set of tree topologies. The transition probabilities are determined by the Metropolis rule (16) using the Bayesian probability of tree topology given the data (14). In the second family, we consider MCMC performing a random walk on the continuous space of tree topologies and branch lengths (3, 17). For both families the moves that change the tree topology may be nearest-neighbor interchanges (NNI), subtree pruning and regrafting (SPR), or tree bisection and reconnection (TBR) moves; see (14) for an introduction to these transitions.

In order to measure convergence of the Markov chain, we use the notion of mixing time, Tmix, which is standard in probability theory (18). The mixing time is, for the worst initial state T0, the first time that the total variation distance between the distribution of Tt (i.e., the chain at time t) and the stationary distribution is at most 1/4. (The constant 1/4 is somewhat arbitrary and simply needs to be <1/2.) The above definition of mixing time implies that for any ϵ > 0, after ≤Tmix iterations, the Markov chain is variation distance ≤ϵ from the stationary distribution.

We can now state our main theoretical result: There exists a constant c > 0 such that in the setting described above, given N characters, Math, generated from the mixtures of Math and Math on n ≥ 5 taxa, with probability at least 1 – exp(–cN) over the data generated, the mixing time of MCMC algorithms with NNI, SPR, or TBR transitions is at least exp(cN).

The formal proof of this statement appears in (12). We follow with a heuristic argument. The algorithmic computations below were performed for both the binary CFN model (1921) and the Jukes-Cantor model. In both of these models, the branch length a = αt where α is the rate from state i to j, for ij, and t is time.

The convergence properties of a Markov chain requires a detailed understanding of the weighted geometry of the state space. One aspect of the geometry is depicted in Fig. 1. In this figure, two trees are joined by an edge if they are connected by a single NNI transition. One can see that our two generating trees T1 and T2 have maximum distance.

The second aspect of the geometry are the posterior probabilities of tree topologies. The posterior probability of tree topology T is denoted by w(T). It is natural to expect that for long sequences, w(T) is essentially determined by the branch lengths that maximize the expected log-likelihood. In other words, Math where the expectation is over the probability distribution μ generating the data.

Figure 1 considers the expected log-likelihood J(T) for data generated by taking independent samples from our mixture μ of the two trees T1 and T2 on five taxa, where all internal branches have length a = 0.1 and all terminal branches have length a2 = 0.01. In Fig. 1, our generating trees have maximum expected log-likelihood J(T), and the intermediate trees in this space have smaller expected log-likelihood. Thus, to traverse between the two maxima trees, we need to traverse a valley. Such a valley implies slow convergence, via mathematical techniques known as conductance and isoperimetric inequalities (18).

A similar picture holds for SPR and TBR transitions. On trees on five taxa TBR and SPR moves are identical. To traverse between the trees with maximum expected likelihood, T1 and T2, we need to pass through trees with smaller expected likelihood. This is the key property that implies slow convergence. For SPR and TBR transitions, tree T1 and T2 are connected to 12 other trees but not to each other.

In Fig. 2 we show how the maximum expected log-likelihood J(T) varies with the NNI distance from the generating trees T1 and T2 for varying values of internal branch length a in the generating trees, where the terminal branch lengths are given by a2.

Fig. 2.

The depth of the valley between the generating trees T1 and T2 with five taxa. The chart shows the maximum expected log-likelihood for different classes of trees for varying generating parameter a where the internal edges have length a and the terminal edges have length a2. For the green trees and blue trees from Fig. 1, we plot the value of J(T) – J(T1). For the yellow trees (i.e., the generating trees T1 and T2), this value is 0.

The implications of mixture distributions to phylogeny has recently received considerable theoretical attention (22, 23) and has clear practical importance. A simple example that often contains characters from multiple trees is molecular data consisting of DNA sequences for more than one gene. It is well known that phylogenetic trees can vary between genes [for example, see (24) for a discussion].

The numerical values of the constants a and c are not explicit in our results. However, simulations suggest that even moderate values such as a = 0.1 and N = 1000 have very slow convergence, and in fact starting at the tree T1 or T2 it will never visit any other tree topology. Moreover, in both cases, the likelihood plot suggests quick convergence. For these parameters, the behavior of the chain on data coming from mixtures or from data generated from a single tree is indistinguishable for as long as we run our experiments (millions of iterations). See fig. S1 (12) for log-likelihood plots illustrating these examples.

For small trees one can hope to overcome the slow convergence by using multiple starting states. However, mixtures coming from large trees may contain multiple species subsets where one tree has T1 as an induced subtree and the other has T2. If there are k such subsets, then about 15k random starting points will be needed. Thus if there are 10 disagreement subsets, then 1510 random starting points will be needed in order to sample from the posterior distribution.

A popular MCMC program for phylogeny, MrBayes (2), uses Metropolis-coupled MCMC (25), which is designed to avoid bottlenecks in the state space. A key open question is to understand whether Metropolis-coupled MCMC is successful in avoiding bottlenecks created by mixtures. Resolving this question will require more delicate and detailed mathematical analysis. It is known that Metropolis-coupled MCMC converges exponentially slowly in some settings but avoid bottlenecks in others (26), but even in these simple cases a very detailed understanding of the state space is needed. If Metropolis-coupled MCMC successfully avoids the bottlenecks created by mixtures, then it may serve as a useful tool for identifying data generated from mixtures.

In our setting, BMCMC methods fail in a clearly demonstrable manner. We expect that there is a more general class of mixtures where BMCMC methods fail in more subtle ways. These subtle failures may occur for many real-world examples where the Markov chains quickly converge to some distribution other than the desired posterior distribution. Users of BMCMC methods should ideally avoid mixture distributions that are known to produce degenerate behavior in various phylogenetic settings (27, 28). A good practice is to decompose the data into nonconflicting signals and perform phylogenetic reconstruction separately on each signal. Our work highlights important unresolved questions: how to verify homogeneity of genomic data and what phylogenetic methods can efficiently deal with mixtures.

Supporting Online Material

Materials and Methods

SOM Text

Figs. S1 and S2

References and Notes

View Abstract

Navigate This Article