Research Article

Genomic architecture and introgression shape a butterfly radiation

See allHide authors and affiliations

Science  01 Nov 2019:
Vol. 366, Issue 6465, pp. 594-599
DOI: 10.1126/science.aaw2090

Following gene flow in butterfly genomes

The role of hybridization in evolution and species radiations has long been debated. In Heliconius butterflies, introgression was a major factor in their radiation, and the genetic variation it imparted into species is variable across the genome. Edelman et al. developed a new sequencing strategy and produced 20 Heliconius genomes (see the Perspective by Rieseberg). They also developed a means by which to identify genetic variation that originates from incomplete lineage sorting versus hybridization. Applying this model to their newly developed genomes, they investigated the evolutionary history of the genus and, in particular, the impact of introgression.

Science, this issue p. 594; see also p. 570


We used 20 de novo genome assemblies to probe the speciation history and architecture of gene flow in rapidly radiating Heliconius butterflies. Our tests to distinguish incomplete lineage sorting from introgression indicate that gene flow has obscured several ancient phylogenetic relationships in this group over large swathes of the genome. Introgressed loci are underrepresented in low-recombination and gene-rich regions, consistent with the purging of foreign alleles more tightly linked to incompatibility loci. Here, we identify a hitherto unknown inversion that traps a color pattern switch locus. We infer that this inversion was transferred between lineages by introgression and is convergent with a similar rearrangement in another part of the genus. These multiple de novo genome sequences enable improved understanding of the importance of introgression and selective processes in adaptive radiation.

Adaptive radiations play a fundamental role in generating biodiversity. Initiated by key innovations and ecological opportunity, radiation is fueled by niche competition that promotes rapid diversification of species (1). Reticulate evolution may enhance radiation by introducing genetic variation, enabling rapidly emerging populations to take advantage of new ecological opportunities (2, 3). Diverging from its sister genus Eueides ~12 million years (My) ago, Heliconius radiated in a burst of speciation in the last ~5 My (4). Introgression is well known in Heliconius, with widespread reticulate evolution across the genus (5), although this has been disputed (6). Nonetheless, how introgression varies across the genome is known only in one pair of sister lineages (7, 8). Here, we use multiple de novo whole-genome assemblies to improve the resolution of introgression, incomplete lineage sorting (ILS), and genome architecture in deeper branches of the Heliconius phylogeny.

Phylogenetic analysis

We generated 20 de novo genome assemblies for species in both major Heliconius subclades and three additional genera of Heliconiini. We then aligned the 16 highest-quality Heliconiini assemblies to two Heliconius reference genomes and seven other Lepidoptera genomes, resulting in an alignment of 25 taxa (9). De novo assembly provides superior sequence information for low-complexity regions, allows for discovery of structural rearrangements, and improves alignment of evolutionarily distant clades (10). Other studies in Heliconius have shown a high level of phylogenetic discordance, arguably a result of rampant introgression (4, 5). We attempted to reconstruct a bifurcating species tree by estimating relationships using protein-coding genes, conserved coding regions, and conserved noncoding regions. We generated phylogenies with coalescent-based and concatenation approaches using both the full Lepidoptera alignment and a restricted, Heliconiini-only subalignment. These topologies were largely congruent among analytical approaches, but weakly supported nodes were resolved inconsistently. These approaches therefore failed to resolve the phylogeny of Heliconius as a simple bifurcating tree (Fig. 1A and fig. S20).

Fig. 1 Phylogeny and phylogenetic networks of Heliconius do not support a bifurcating tree.

(A) All nodes resolved in a majority of species trees are shown in this cladogram (heavy black lines), whereas the poorly resolved silvaniform clade is collapsed as a polytomy (fig. S20). The 500 colored trees were sampled from 10-kb nonoverlapping windows and constructed with maximum likelihood. (B and C) High-confidence tree structure (black) and introgression events (red) are shown as solid lines. Dashed red lines indicate weakly supported introgression events. Gray branch ends are cosmetic. The melpomene-silvaniform clade is shown in (B) and the erato-sara clade in (C). Euclidean lengths of solid black lines are proportional to genetic distance along the branches. Scale bars are in units of substitutions per site. Breaks at the base in (B) indicate that the branch leading to H. doris has been shortened for display.

To determine whether hybridization was a cause of the species tree uncertainty, we calculated Patterson’s D statistics (11) for every triplet of the 13 Heliconius species using a member of the sister genus, Eueides tales, as the outgroup. In 201 of 286 triplets, we observed values significantly different from zero based on block-jackknifing, demonstrating strong evidence for introgression (fig. S53). However, these tests alone yield little quantitative information about admixture. We therefore used phyloNet (12) to infer reticulate phylogenetic networks of these species based on random samples of 100 10-kb windows across the alignment. For each sample, we coestimated all 100 regional gene trees and the overall species network in parallel (12). To improve alignments, we analyzed the melpomene-silvaniform group with respect to the Heliconius melpomene Hmel2.5 assembly (13) and the erato-sara group with respect to the H. erato demophoon v1 assembly (9, 14). Most species exhibited an admixture event at some point in their history using this method; we confirmed extensive reticulation among silvaniform species and discovered major gene-flow events in the erato-sara clade. On the basis of these results, we propose the reticulate phylogenies shown in Fig. 1, B to C.

Correlation of local ancestry with genome architecture

We next analyzed the distribution of tree topologies across the genome, again treating each major clade separately and using its respective reference genome. The melpomene-silvaniform group lacked topological consensus, unsurprisingly because introgression, especially of key mimicry loci, is well known in this clade (15). The most common tree topology was found in only 4.3% of windows, with an additional 14 topologies appearing in 1.0 to 3.4% of windows (fig. S19 to S21). By contrast, we here focus on the erato-sara group, in which two topologies dominate (Fig. 2). One (Fig. 2B, Tree 2) matched our bifurcating consensus topology (Fig. 1A) and a recently published tree (4), whereas the other (Tree 1) differs in that it places H. hecalesia and H. telesiphe as sisters.

Fig. 2 Local evolutionary history in the erato-sara clade is heterogeneous across the genome.

(A) Each bar represents a chromosome, in terms of the H. erato reference (14). Colored bands represent tree topologies of each 50-kb window; colors correspond to the topologies in (B), with black regions showing missing data. (B) The eight most common trees. The value in the top left corner is the percentage of all 50-kb windows that recover that topology. (C) Each histogram corresponds to the topology of the same color in (B) and shows the distribution of the number of consecutive 50-kb windows with that topology. Arrows indicate long blocks in inversions.

Regions with local topologies discordant from the species tree may have arisen through introgression or ILS. To make within-topology locus-by-locus inferences, we developed a statistical test to distinguish between ILS and introgression based on the distribution of internal branch lengths among windows for a given three-taxon subtree, conditional on its topology. We call this method “quantifying introgression via branch lengths” (QuIBL). In the absence of introgression, we expect internal branch lengths of triplet topologies discordant with the species tree (due to ILS) to be exponentially distributed. However, if introgression has occurred, then their distribution should have that same exponential component but also include an additional component with a nonzero mode corresponding to the time between the introgression event and the most recent common ancestor of all three species (9). Like other tree-based methods, QuIBL is potentially sensitive to the assumption that each tree is inferred from loci with limited internal recombination (fig. S75). We therefore chose small (5-kb) windows to reduce the probability of intralocus recombination breakpoints.

For every triplet in the erato-sara clade, we calculated the likelihood that the distribution of internal branch lengths is consistent with introgression or with ILS only. We formally distinguished between these two models using a Bayesian information criterion (BIC) test with a strict cutoff of ∆BIC > 10. Consistent with our results from D statistics, we found that 13 of 20 triplets have evidence for introgression (table S13). For example, using QuIBL on the triplet H. eratoH. hecalesiaH. telesiphe, we infer that 76% of discordant loci, or 38% of all loci genome-wide, are introgressed. Averaging over all triplets, we infer that 71% (67% with BIC filtering) of loci with discordant gene trees have a history of introgression, or 20% (19% with BIC filtering) of all triplet loci, indicating a broad signal of introgression throughout the clade [Eq. 7.7, table S13; see (9) for additional discussion].

In hybrid populations, individuals have genomic regions that originate from different species and may be incompatible with the recipient genome or with their environment (16). Linked selection causes harmless or even beneficial introgressed loci to be removed along with these deleterious loci if they are tightly linked; this effect depends on the strength of selection and the local recombination rate (17, 18). We therefore expect introgressed loci to be enriched in regions where selection is likely to be weak, such as gene deserts or regions of high recombination, where harmless introgressed loci more readily recombine away from linked incompatibility loci.

In Heliconius, even distant species such as H. erato and H. melpomene have the same number of broadly collinear chromosomes (13), facilitating direct comparisons among species. Furthermore, each chromosome in Heliconius has approximately one crossover per meiosis in males (there is no crossing over in female Heliconius) (14, 19). Chromosomes vary in length, and chromosome size is inversely proportional to recombination rate per base pair (8, 13). We found a strong correlation between the fraction of windows in each chromosome that show a given topology and physical chromosome length (Fig. 3A). Such relationships exist for all eight trees in Fig. 2B (9), but we focus here on the two most common trees: Tree 1 has a strongly negative correlation with chromosome size (r2 = 0.883, t = 11.7, 18 df, p < 0.0001), whereas Tree 2 (concordant with our inferred species tree) has a positive correlation (r2 = 0.726, t = 6.9, 18 df, p < 0.0001). Results from QuIBL indicate that 94% of windows that recover a Tree 1 triplet topology are consistent with introgression (fig. S70 and table S13). The Z (sex) chromosome 21 is strongly enriched for Tree 2, suggesting that it may harbor more incompatibility loci than autosomes. Interspecific hybrid females in Heliconius are often sterile, conforming to Haldane’s rule, and sex chromosomes have been implicated as being particularly important in generating incompatibilities (8, 2024).

Fig. 3 Chromosomal architecture is strongly correlated with local topology.

Tree 1 is shown in red and Tree 2 is shown in blue, as in Fig. 2. (A) Tree 1 shows a negative relationship with chromosome size, whereas Tree 2 shows a positive relationship. Lines are linear regressions with chromosome 21 excluded. Numbers along the top indicate chromosome number. (B) Each chromosome was divided into 10 equally sized bins, and the occupancy of each topology in each bin was calculated as the number of windows that recovered the topology in the bin divided by the number of windows that recovered the topology in the chromosome. (C) Windows are binned by recombination rate, and boxes show the fraction of each tree in each bin for each chromosome separately. Numbers above boxes are the numbers of windows in each bin. (D) Boxes showing the relationship of tree topology with coding density. Asterisk denotes significance at the 5% level (paired t test, p < 0.025). In all boxplots, the central line is the median, box edges are first and third quartile, and whiskers extend to the largest value no farther than 1.5 × (interquartile range).

To test whether the pattern that we observed among chromosomes is related to differences in recombination, we investigated the relationship between recombination rate and tree topology within chromosomes. The recombination rate declines at the ends of chromosomes (fig. S85), and the species tree (Tree 2) is more abundant in those regions (Fig. 3B). In addition, when windows were grouped by local recombination rate calculated from population genetic data (9, 14), we observed a strong relationship with the recovered topology (Fig. 3C). Finally, we observed a minor enrichment of Tree 1 in regions of very low gene density, but this effect was weak (Fig. 3D) compared with that of recombination. Taken together, these results show that tighter linkage on longer chromosomes and in lower recombination regions within chromosomes leads to removal of more introgressed variation in those regions. This very strong correlation is consistent with a highly polygenic architecture of incompatibilities between species.

Introgression of a convergent inversion

The topology block size distribution in the erato clade generally decayed exponentially (Fig. 2C), but two unusually long blocks contained minor topologies: one on chromosome 2 (Tree 3, composed of three sub-blocks) and the other on chromosome 15 (Tree 4). Our study of the ~3-Mb topology block on chromosome 2 confirms an earlier finding of an inversion in H. erato (13), and we show here that its rare topology is most likely explained by ILS, including a long period of ancestral polymorphism (fig. S95).

The topology block on chromosome 15 is of particular interest because it spans cortex, a genetic hotspot of wing color pattern diversity in Lepidoptera (25, 26). We hypothesized that this block could be an inversion, as in H. numata, where the P1 “supergene” inversion polymorphism around cortex controls color pattern switching among mimicry morphs (27). This block recovers H. telesiphe and H. hecalesia as a monophyletic subclade, which together are sisters to the sara clade (Fig. 2B, Tree 4). We searched our de novo assemblies for contigs that mapped across topology transitions. Taking H. melpomene as the standard arrangement, we found clear inversion breakpoints in H. telesiphe, H. hecalesia, H. sara, and H. demeter. Conversely, H. erato, H. himera, and E. tales all contain contigs that map in their entirety across the breakpoints (Fig. 4A), implying that they have the ancestral H. melpomene arrangement.

Fig. 4 Parallel evolution of a major inversion at the cortex supergene locus.

(A) Map of 1.7-Mb region on chromosome 15. Coordinates are in terms of Hmel 2.5 and ticks are in Mb. Tree topology colors correspond to those in Fig. 2. Genes are shown as black rectangles; cortex is highlighted in yellow. Each line shows the mapping of a single contig. Aligned sections of each contig are shown as thick bars, whereas unaligned sections are shown as dotted lines. Arrows indicate the strand of the alignment. The H. erato group breakpoints are shown with red vertical lines and the H. numata breakpoints are shown with green vertical lines. (B) Evolutionary hypotheses consistent with the topology observed in this inversion in the context of the previously estimated phylogenetic network. The three species used in the triplet gene tree method, H. erato, H. telesiphe, and H. sara, are shown as black lines; lineages not included are shown as gray lines. (C) Histogram of internal branch lengths (T2) in windows with the topology H. erato (H. telesiphe, H. sara). The inferred ILS distribution is shown as a dashed line, and the inferred introgression distribution is shown as a dotted line. The average internal branch length in the inversion is shown as a green vertical line. (D) Histogram of normalized DXY (T3) between H. telesiphe and H. sara. Mean normalized DXY in the inversion is shown as a green vertical line.

This chromosome 15 inversion covers almost exactly the same region as the 400-kb P1 inversion in H. numata (25, 27, 28). However, de novo contigs from our H. numata assembly show that the breakpoints of P1 are close to but not identical to those of the inversion in the erato clade (Fig. 4A). Furthermore, in topologies for H. numata, H. telesiphe, H. erato, and E. tales across chromosome 15, not a single window recovered H. numata and H. telesiphe as a monophyletic subclade, as would be expected if the erato group inversion were homologous to P1 in H. numata.

We used QuIBL with the triplet H. eratoH. telesipheH. sara to elucidate the evolutionary history of this inversion. A small internal branch would suggest ILS, whereas a large internal branch would be more consistent with introgression (Fig. 4B). The average internal branch length in the inversion was much longer than the genome-wide average, corresponding to a 79% probability of introgression (Fig. 4C). If the inversion were polymorphic in the ancestral population for some time, then we could also recover a similarly long internal branch (Fig. 4B, center). We distinguished between this longer-term polymorphic scenario and introgression by comparing the genetic distance (DXY) between H. telesiphe and H. sara, represented by T3 in Fig. 4B. Normalized DXY (as in fig. S95) within the inversion is ~25% less than in the rest of the genome. Given that this is a large genomic block, introgression is therefore the most parsimonious explanation for the evolutionary history of the inversion (Fig. 4D) (29).


Species involved in rapid radiations are prone to hybridization because of frequent geographical overlap with closely related taxa. In both the melpomene and erato clades of Heliconius, introgression has overwritten the original bifurcation history of several species across large swathes of the genome, a pattern also observed in Anopheles mosquitos (30). This observation is also consistent with genomic analysis of other rapid radiations characterized by widespread hybridization and introgression, including Darwin’s finches (2) and African cichlids (31). In other radiations, the role of introgression is less clear: in Tamias chipmunks, widespread introgression of mitochondrial DNA was identified, in contrast to an absence of evidence for nuclear gene flow (32). With few genomic comparisons available to date, it is perhaps too early to say whether introgression is a major feature of adaptive radiations in general, but evidence thus far suggests this to be the case.

Our results raise the question of why some genomic regions cross species boundaries and others do not. In the erato clade, we found a strong correlation between recombination rate and introgression probability. Similar associations with topology also exist between sister species in the melpomene clade (8). Associations between recombination and introgression in hybridizing populations of fishes and monkey flowers (Mimulus spp.) support the role of linked selection on a highly polygenic landscape of interspecific incompatibilities (18, 33, 34). Our results establish that this relationship persists and may indeed be strengthened with time since introgression. While hybridization is ongoing, many introgressed blocks are constantly reintroduced into the population. If linked to weakly deleterious alleles, introgressed loci will finally be purged by linked selection only long after introgression ceases.

Recombination rate alone cannot account for differential introgression, so we must delve into specific regions to elucidate their function and relevance to speciation. It is critical, therefore, to have tools that can confidently identify introgressed loci, and much effort has gone into developing such methods (11, 35). Our test using internal branch lengths in triplet gene trees is based in coalescent theory and takes advantage of the discriminatory power of a property of gene trees not explicitly accounted for by other methods. QuIBL allows us to assess probability of introgression for each locus in each species triplet (9). Here, we used this method to identify the evolutionary origin of a convergent inversion that has undergone multiple independent introgression events and to show that genomic regions with discordant topologies arose mostly through hybridization. Just as sex aids adaptation within species, occasional introgression and recombination among species can have major long-term effects on the genome, contributing variation that could fuel rapid adaptive divergence and radiation.

Supplementary Materials

Materials and Methods

Figs. S1 to S95

Tables S1 to S14

References (3991)

References and Notes

  1. Materials and methods are available as supplementary materials.
Acknowledgments: We thank the Harvard FAS Research Computing team, the BYU Fulton Supercomputing Lab, and the Smithsonian Institution High Performance Cluster (SI/HPC) for their support. Conversations with S. Martin were instrumental in developing our thinking around recombination rate. We also thank J. Edelman for assistance with the Scaffolding with DISCOVAR pipeline and J. Zhu and L. Nakhleh for their guidance with the PhyloNet toolkit. We thank C. Frandsen for assistance with figure design, E. Harney for assistance with qpGraph, as well as N. Rosser, F. Seixas, T. Xiong, P. Muralidhar, and C. Veller for illuminating discussions. Funding: This project was funded by a SPARC Grant from the Broad Institute of Harvard and MIT and startup and studentship funds from Harvard University to J.M. and N.B.E. Additional funding was received through NSF grant DGE1745303 to M.M.; BBSRC Core Strategic Programme Grant BB/CSP17270/1 at the Earlham Institute to B.C., G.G.-A., and F.D.P.; a Herchel Smith Postdoctoral Research Fellowship and a Smithsonian Tropical Research Institute Fellowship to J.D.; NSF grant OIA 1736026 to B.C. and R.P.; Brazilian National Council for Scientific and Technological Development” (CNPq) grant 309853/2014-1 to G.R.P.M.; Fondos Concursables Universidad del Rosario 2016-PIN-2017-001 to C.S.; Marie Sklodowska-Curie fellowship (FITINV, N 655857) and NSERC fellowship to M.C.; NERC award UKSBS PR18037 to M.B.; NSF grant IOS-1656514 to R.D.R.; NSF grant IOS-1656389 to R.P.; NERC NE/K012886/1 to K.K.D.; and NSF grant IOS-1452648 and NIH grant GM108626 to M.K. M.J. was supported by French National Research Agency: ANR-12-JSV7-0005-HybEvol and ANR-18-CE02-0019-Supergene. A.J.B. was supported by NIH grants 5U54CA193313 and GG010211-R01-HIV and AFOSR grant FA9550-18-1-0415. D.E.N. was supported in part by federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under grant U19AI110818 to the Broad Institute. D.J.’s work on DISCOVAR de novo was funded in part with federal funds from the National Human Genome Research Institute, U.S. National Institutes of Health, U.S. Department of Health and Human Services, under grants R01HG003474 and U54HG003067. LepBase development was supported by BBSRC grants BB/K020161/1 and BB/R015325/1. The Smithsonian Tropical Research Institute provided funding for W.O.M. R.P. and S.M.V.B. were additionally supported by the University of Puerto Rico (Puerto Rico INBRE grant P20 GM103475 from the National Institute for General Medical Sciences). Author contributions: N.B.E., N.P., D.E.N., R.C., S.K., M.B., R.D.R., K.K.D., M.K., M.J., C.D.J., W.O.M., D.J., and J.M. conceived the project; N.B.E., P.B.F., M.M., B.C., D.J., and J.M. designed experiments; J.D., G.R.P.M., C.S., M.C., B.A.C., R.D.R., K.K.D., M.J., C.D.J., W.O.M., and J.M. supplied specimens and extracted DNA; D.J., B.C., G.G.-A., and F.D.P. assembled genomes; P.B.F. inferred species trees; N.B.E. performed reference gap filling, phylogenetic network, local gene tree, recombination rate, and genome structure analyses; J.D. and S.M.V.B. generated linkage maps; M.M., J.W., and A.J.B. developed the branch length test; and N.B.E. and J.M. wrote and edited the manuscript with input from all authors. Competing interests: The authors declare no competing interests. Data and materials availability: Reads generated are available in the SRA, BioProject PRJNA531399. Genome assemblies are available on LepBase ( The full multispecies, whole-genome alignment, as well as subalignments for windows and their respective inferred phylogenies, are available on Dryad (36). Also on Dryad are configuration files for progressiveCactus, the inferred LD recombination map for H. erato, loci used for phyloNet, and repeatMasker output for all assemblies. All code used to analyze data is available on Zenodo (37, 38).
View Abstract

Stay Connected to Science

Navigate This Article