Research Article

Ancient hybridizations among the ancestral genomes of bread wheat

See allHide authors and affiliations

Science  18 Jul 2014:
Vol. 345, Issue 6194, 1250092
DOI: 10.1126/science.1250092

Abstract

The allohexaploid bread wheat genome consists of three closely related subgenomes (A, B, and D), but a clear understanding of their phylogenetic history has been lacking. We used genome assemblies of bread wheat and five diploid relatives to analyze genome-wide samples of gene trees, as well as to estimate evolutionary relatedness and divergence times. We show that the A and B genomes diverged from a common ancestor ~7 million years ago and that these genomes gave rise to the D genome through homoploid hybrid speciation 1 to 2 million years later. Our findings imply that the present-day bread wheat genome is a product of multiple rounds of hybrid speciation (homoploid and polyploid) and lay the foundation for a new framework for understanding the wheat genome as a multilevel phylogenetic mosaic.

The rise of modern agriculture and wheat domestication in the Fertile Crescent ~10,000 years ago (14) was pivotal in shaping modern human history. Early farming practices made use of wild diploid wheat species (i.e., Aegilops and Triticum species), but as agriculture evolved, wild crops were gradually substituted with domesticated diploid and polyploid wheat varieties (3, 4). Presently, the allohexaploid bread wheat (Triticum aestivum, 2n = 6x = 42 chromosomes; genomic code AABBDD) dominates global wheat production. Because of its economic value and the desire for its genetic improvement, questions concerning the evolution and domestication of wheat have been under intense scientific scrutiny (5, 6).

The bread wheat subgenomes A, B, and D were originally derived from three diploid (2x; 2n = 14) species within tribe Triticeae [see figure 1 in (7)]: Triticum urartu (AA), an unknown close relative of Aegilops speltoides (BB), and Ae. tauschii (DD) (4, 8). The initial allopolyploidization event is hypothesized to have involved the A and B genome donors, resulting in the extant tetraploid emmer wheat (T. turgidum; AABB). This species subsequently hybridized with the D genome donor to form modern hexaploid bread wheat (AABBDD) (4, 8).

Tetraploid emmer wheat is believed to have originated within the past few hundred thousand years (9), whereas hexaploid bread wheat is thought to have originated with modern agriculture ~10,000 years ago (4). The time of origin for hexaploid bread wheat is currently supported solely by archeological evidence (2, 3) and the apparent absence of hexaploid wheats in wild populations (4). Although the relatedness between the bread wheat subgenomes and diploid wheat species has been well documented (8, 10), a clear understanding of the phylogenetic history and divergence times among the three A, B, and D genome lineages is still lacking (9, 1113). This knowledge gap is mainly a consequence of the paucity of Triticeae fossils (14), which has prevented investigations of diversification through time; extensive topological discordance between wheat gene trees (15); and, most importantly, the lack of genome sequences of the hexaploid bread wheat and its close diploid relatives. Improved understanding of the phylogenetic relationships among the diploid species of wheat and the bread wheat subgenomes is important for understanding genome function and for future agricultural crop improvement in light of a changing global climate (16).

Gene tree topology analyses

We used the genome sequences of hexaploid bread wheat subgenomes (denoted TaA, TaB, and TaD) and five diploid relatives (T. monococcum, T. urartu, Ae. sharonensis, Ae. speltoides, and Ae. tauschii) (7, 17, 18) to generate a genome-wide sample of 275 gene trees and to estimate the phylogenetic history of the A, B, and D genome lineages. Barley (Hordeum vulgare), Brachypodium distachyon, and rice (Oryza sativa) were used as outgroup species. To generate multiple alignments of ortholog genes, we employed a phylogeny-aware strategy (19), which simultaneously filters alignments for unreliably aligned codon sites and putative erroneously predicted ortholog sequences (fig. S1 and supplementary materials and methods). Finally, we used BEAST (20) to calculate gene trees topologies.

We found that the basal relatedness among the three lineages A, B, and D varied substantially among the 275 gene trees, with the lineage topologies A(B,D) and B(A,D) each being about twice as common as D(A,B) (Fig. 1A and Table 1). Stochastic population genetic processes typically cause incomplete lineage sorting (ILS), which results in topological discordance (i.e., variation in topology) among individual gene trees. For three taxa under ILS alone, the gene tree topology that equals the species tree topology is expected to be more common than the other two. However, for our data the observed lineage topology ratios differed significantly [P < 0.01; likelihood ratio test; df = 1 (Table 1)] from this expectation. This suggests the presence of phylogenetic signals additional to ILS in the data. Except in rare instances of deep coalescence, bread wheat homeologs consistently formed monophyletic clades with orthologs of their close diploid relatives (Fig. 1A), and never with each other. This rules out nonhomologous gene conversion and nonhomologous recombination as explanations for the observed topological discordance. To enable analyses of individual chromosomes, we also made use of a considerably larger data set of 2269 maximum likelihood gene tree topologies taken from (7), which did not include the diploid Triticum and Aegilops species. These gene trees support genome-wide signals of significantly skewed topology frequencies, with B(A,D) and A(D,B) topologies being most common (Table 1). Taken together, these results suggest that lineages A and B are more closely related to D individually than to each other, which agrees with a model of hybrid origin of the D lineage (Fig. 1B).

Fig. 1 Analyses of gene tree topologies.

(A) Superimposed ultrametric gene trees in a consensus DensiTree plot. The branch color changes for every 100 trees plotted. (B) Topology-based species phylogeny, assuming incomplete lineage sorting using a data set of 2269 gene trees inferred by PhyloNet. The results presented represent analyses of all gene trees (2269). The numbers on the branches represent estimates of parental contributions to the hybrid. Range estimates of parental contribution are extrapolated from results reported in Table 1. Species names are abbreviated as follows: Ash, Aegilops sharonensis; Asp, Ae. speltoides; At, Ae. tauschii; Tm, Triticum monococcum; Tu, T. urartu; Hv, Hordeum vulgare; TaA, T. aestivum A subgenome; TaB, T. aestivum B subgenome; TaD, T. aestivum D subgenome; Bd, Brachypodium distachyon; Os, Oryza sativa.

Table 1 Distribution of ABD lineage topologies in gene trees.

Analyses were made on different data subsets: diploid genomes only (2x), hexaploid genomes only (6x), and either whole genomes or gene trees from individual chromosomes (Chr.). Interrelationships within the A, D, and B clades are not considered in the data set that includes diploids. Topologies including diploids were estimated with Bayesian MCMC sampling using the HKY+G nucleotide substitution model, whereas topologies excluding diploids were taken from IWGSC (7) and represent maximum likelihood topologies under the GTR+I+G model. Bold numbers represent the largest topology group. The likelihood ratio test was used to test the probability (P) of observing the data under the model of multispecies coalescent and the (conservative) assumption that the most common observed tree topology equaled the species tree topology.

View this table:

Under the assumption of ILS and a hybrid origin of the D lineage from the A and B lineages, analysis of tree topology frequencies revealed roughly equal contributions of each parental lineage to the D lineage (Table 1). There was considerable stochasticity among the different data subsets, and between 65 and 87% of the genes displayed deep coalescence for the target nodes. Similar results were obtained with a topology-based parsimony approach to estimate phylogenetic genome networks under the assumption of a single homoploid hybridization at the genome-wide level (Fig. 1B) and for the majority of the chromosome-specific analyses (table S4). Finally, ~80% of the genes could be anchored to a chromosome position in the hexaploid genome using the in silico gene order predictions from the bread wheat genome sequence (7). Such positional information can be used to investigate whether different regions of the genome have distinct phylogenetic signals—that is, conserved chromosome blocks from the parental genomes. Homeologs within gene trees showed highly conserved syntenic relationships (fig. S3); however, anchoring of gene tree topologies to chromosome positions in bread wheat did not support the presence of larger chromosome blocks with a single parental origin (7), indicating a relatively homogeneous hybrid signal throughout the D subgenome.

Genome divergence times

Because topology-based phylogenetic analyses do not consider the temporal scale, we used a Bayesian hierarchical model to further estimate the genome divergence times under the multispecies coalescent (21) from pairwise ortholog coalesent distributions. This approach does not assume a treelike species phylogeny and can handle very large data sets.

Concurrent with the topology-based analyses (Fig. 2, A and B, and fig. S4), genome divergence estimates showed no signs of being affected by nonhomologous gene conversion or recombination within the hexaploid genome, which would have resulted in shallow coalescence of bread wheat subgenomes (figs. S4 and S5). The basal divergence in Aegilops/Triticum was estimated to have occurred between the A and the B genome lineages ~7 million years ago (Ma) (Table 2). Both A-D and B-D divergence times overlap and are estimated to be 1 to 2 million years younger than the A-B divergence (Fig. 2, A and B). This contradicts a treelike phylogeny for the three subgenome lineages (Table 2 and Fig. 2A) and favors a model of hybridization between the A and B genomes, giving rise to the D genome. Genome divergence times did not support the more complex models of hybridization patterns, as suggested by the topology analyses assuming two hybridization events (table S4). Furthermore, the majority of the analyses produced slightly younger divergence of A and D lineages compared with B and D lineages (Fig. 2A and Table 2), indicating that gene flow from A to D may have persisted after gene flow from B to D had ceased.

Fig. 2 Coalescent-based genome divergence analyses.

Coalescence times were estimated as the median of Bayesian MCMC sampling in BEAST. Genome divergence estimates were inferred with a Bayesian hierarchical model through WinBUGS and the R2OpenBUGS R package (35). (A) Divergence times (mean, 95% credibility interval) for the genome lineages A, B, and D for 2269 gene trees, excluding diploid species. A-B, blue; A-D, red; B-D, green. (B) Genome divergence network including diploid and hexaploid wheat genomes. Node age is given as mean genome divergence time, estimated independently for each pair of species representing that node. For nodes with more than two decendant tips, age is given as the mean for all relevant pairwise species comparisons, and bars span from the lowest minimal to the highest maximal 95% bound for their credibility intervals. Due to evidence of recent interlineage hybridizations (both in topology and coalescence analyses) in the Ae. sharonensis and Ae. speltoides genomes, these species are not considered in the estimation of the ancestral A, B, and D lineage divergence.

Table 2 Estimated genome divergence times.

All age estimates are given in units of million years ago as 95% credibility intervals (CIs). The CI of the Tm-TaA divergence and the CI of the At-TaD divergence represent the summarized CI ranges of two hierarchical Bayesian models using median plus median. The Tm-TaA divergence and the At-TaD divergence are expected to be overestimates of the actual polyploidization times due to the fact that the true ancestral populations to the A and D subgenomes in bread wheat were not sampled. Species names are abbreviated as follows: At, Aegilops tauschii; Tm, Triticum monococcum; TaA, T. aestivum A subgenome; TaD, T. aestivum D subgenome. Dashes indicate no data.

View this table:

The identification of hybridization events in phylogenies strongly depends on taxon sampling. Nevertheless, given that the hybridization event happened basally in the Triticum/Aegilops clade and that the 15 extant diploid species all seem to fall within one of the three lineages A, B, and D (fig. S6), the hybridization pattern is likely to remain unaltered, even with a denser sampling of wheat species. Beyond the phylogenetic evidence presented here, support for a homoploid hybrid origin of the D genome is found in independent analyses using the genome sequence of bread wheat. Both at the base-pair level (7, 22) as well as in gene content (7), the A and B lineages are more similar to the D genome lineage than they are to each other.

Although the existence of homoploid hybrid speciation has been acknowledged for well more than a century (23), it has only recently been recognized as a relatively common phenomenon (2426). Our data support a homoploid hybrid origin of the bread wheat D genome lineage ancestor more than 5 Ma, which is among the oldest cases of homoploid hybrid speciation reported to date.

Divergence of polyploid genomes from diploid relatives

Genome divergence estimates showed that T. monococcum and T. urartu are successive sisters to TaA, Ae. speltoides is sister to TaB, and Ae. sharonensis and Ae. tauschii are successive sisters to TaD (Fig. 2B and table S5). To elucidate the timing of the polyploid speciations giving rise to emmer and bread wheat, we analyzed the pairwise coalescence distributions of the TaA and TaD genomes and the closest related diploid species, T. urartu and Ae. tauschii, respectively (Fig. 2B). We estimated the T. urartu-TaA divergence to 0.58 to 0.82 Ma and the Ae. tauschii–TaD divergence to 0.23 to 0.43 Ma (Fig. 2B and Table 1), suggesting that the age of these two polyploidization events could be older than previously suggested (4, 9). It is however important to note that this study includes only single-genome samples from T. urartu and Ae. tauschii and that these samples are not likely to represent plants from the actual ancestor populations that gave rise to the A and D genome constituents. Hence, our genome divergence estimates between diploid and hexaploid genomes are likely to reflect population divergence events before the actual polyploidization times, and additional analyses of broad samples of both T. urartu and Ae. tauschii populations are needed to further improve the accuracy of the polyploidization time estimates (27).

Conclusions

Our study exemplifies how analyses of whole-genome data sets can aid in resolving convoluted patterns of genome evolution caused by ancient hybridization events. We elucidate genome-wide signatures of hybrid ancestry of the wheat D lineage from A and B lineage ancestors (Fig. 3). Not only is bread wheat a product of hybridization and allopolyploidization involving the A, B, and D genomes, but also the ancestral lineages of these three genomes are the result of ancestral hybridization events among themselves. Our findings could have broad implications for understanding genome function and, thus, cultivar improvement in bread wheat.

Fig. 3 Model of the phylogenetic history of bread wheat (Triticum aestivum; AABBDD).

Approximate dates for divergence and the three hybridization events are given in white circles in units of million years ago. Differentiation of the wheat lineage (Triticum and Aegilops) from a common ancestor into the A and B genome lineages began ~6.5 Ma. The first hybridization occurred ~5.5 Ma between the A and B genome lineages and led to the origin of the D genome lineage by homoploid hybrid speciation. The second hybridization, between a close relative (BB) of Ae. speltoides and T. urartu (AA), gave rise to the allotetraploid emmer wheat (T. turgidum; AABB) by polyploidization. Bread wheat originated by allopolyploidization from a third hybridization, between emmer wheat and Ae. tauschii (DD). The three diploid lineages are indicated with color and labels. Inflorescences (spikes) illustrate extant species closely related to those involved in the polyploidizations.

Methods

Sequence data for T. urartu and Ae. tauschii were downloaded from GenBank (accession numbers AOTI00000000 and AOCO000000000.1, respectively). All other sequences were downloaded from the International Wheat Genome Sequencing Consortium (IWGSC) sequence repository at Institut National de la Recherche Agronomique (http://wheat-urgi.versailles.inra.fr/Seq-Repository/Genes-annotations). Preliminary ortholog predictions were carried out with OrthoMCL (28). Only orthologous gene sets containing single-gene copies from all species were used. A phylogeny-aware sequence alignment pipeline modified from (19) was implemented in R language using MAFFT (29) to construct multiple sequence alignments, GUIDANCE (30) to assess sequence alignment quality, and phangorn (31) to construct maximum likelihood topologies, which were used to evaluate ortholog relationships within gene trees. Ultrametric gene trees were estimated in BEAST (20), based on a secondary calibration on the Brachypodium crown node (fig. S2). All analyses of topologies were carried out with the APE R package (32). Estimations of parental genome contributions and species networks from gene tree topologies were calculated using parsimonious inference of hybridization in the presence of ILS (33) implemented in PhyloNet (34) and with a coalescent-based method (described in the supplementary materials). Topology-independent genome divergence times were estimated under the multispecies coalescent model with hierarchical modeling using BUGS (Bayesian inference Using Gibbs Sampling) (www.openbugs.net/w/FrontPage) through the R2OpenBUGS R package (35).

Supplementary Materials

www.sciencemag.org/content/345/6194/1250092/suppl/DC1

Materials and Methods

Supplementary Text

Figs. S1 to S6

Tables S1 to S5

References (3645)

IWGSC Author List

References and Notes

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 39.
  38. 40.
  39. 41.
  40. 42.
  41. 43.
  42. 44.
  43. 45.
  44. Acknowledgments: This work was financed by Norwegian Research Council grant 199387 to Norwegian University of Life Sciences and Graminor AS to O.-A.O. The phylogenetic analyses were run, in part, on the Bioportal supercomputing facility at the University of Oslo. We thank C.-P. Antoine and two anonymous reviewers for valuable comments on the manuscript and S. Sæbø for help on implementation of the OpenBUGS analyses in R. Ortholog alignments, gene trees, and the OpenBUGS model used for multispecies coalescent modeling can be downloaded from Dryad (data available from the Dryad Digital Repository: http://doi.org/10.5061/dryad.f6c34).
View Abstract

Navigate This Article