Report

Assembly and Validation of the Genome of the Nonmodel Basal Angiosperm Amborella

See allHide authors and affiliations

Science  20 Dec 2013:
Vol. 342, Issue 6165, pp. 1516-1517
DOI: 10.1126/science.1241130

Shaping Plant Evolution

Amborella trichopoda is understood to be the most basal extant flowering plant and its genome is anticipated to provide insights into the evolution of plant life on Earth (see the Perspective by Adams). To validate and assemble the sequence, Chamala et al. (p. 1516) combined fluorescent in situ hybridization (FISH), genomic mapping, and next-generation sequencing. The Amborella Genome Project (p. 10.1126/science.1241089) was able to infer that a whole-genome duplication event preceded the evolution of this ancestral angiosperm, and Rice et al. (p. 1468) found that numerous genes in the mitochondrion were acquired by horizontal gene transfer from other plants, including almost four entire mitochondrial genomes from mosses and algae.

Abstract

Genome sequencing with next-generation sequence (NGS) technologies can now be applied to organisms pivotal to addressing fundamental biological questions, but with genomes previously considered intractable or too expensive to undertake. However, for species with large and complex genomes, extensive genetic and physical map resources have, until now, been required to direct the sequencing effort and sequence assembly. As these resources are unavailable for most species, assembling high-quality genome sequences from NGS data remains challenging. We describe a strategy that uses NGS, fluorescence in situ hybridization, and whole-genome mapping to assemble a high-quality genome sequence for Amborella trichopoda, a nonmodel species crucial to understanding flowering plant evolution. These methods are applicable to many other organisms with limited genomic resources.

Amborella (1, 2) has been identified as the single sister species to all other living angiosperms and is a pivotal reference for comparison to other angiosperms (3). However, Amborella is not a genetic model and has no existing genetic map, genetic resources, or genome sequence. Although next-generation sequencing (NGS) provides deep genomic sequence coverage at low cost, short-read assembly remains difficult, and assessing assembly accuracy is problematic without independently derived genomic maps. We produced a whole-genome assembly for Amborella from a mixed data set of 454, Illumina, and Sanger bacterial artificial chromosome (BAC)–end sequences, evaluated the assembly using fluorescence in situ hybridization (FISH), and improved contiguity using whole-genome mapping. FISH has broad utility (4), but has not been used in de novo genome assembly. Likewise, whole-genome mapping has been used to assemble bacterial genomes (5, 6), but has only recently been applied to complex genomes of model organisms (7, 8) to assist with scaffolding and correction of well-advanced genome assemblies.

More than 23 Gb of quality-filtered (9) DNA sequence comprising single-end (SE) 454-FLX, SE 454-FLX+ reads, 11-kb paired-end (PE) 454-FLX, 3-kb PE Illumina HiSeq, and Sanger-sequenced BAC-end reads (10) were combined and assembled (table S1). Assembly (9) resulted in 5745 scaffolds totaling 706 Mb (table S5) with a mean scaffold size of 123 kb and an N50 size of 4.9 Mb, and N90 scaffold metrics that indicate that 90% of our assembled sequence resides within 155 scaffolds greater than 1.1 Mb in length (table S5).

Flow cytometry was used to estimate the size of the Amborella genome at ~870 Mb (11), while our sequence-based size assessments (9, 10, 12, 13) suggest that the Amborella genome size is closer to 748 Mb. Our high-quality sequence represents an average depth of coverage of ~31×, and the assembly covers >94% of the genome.

Long contig and scaffold assemblies are required to understand genome structure, enable gene identification, and support subsequent comparative, structural, and population genomics studies. We sought long continuous stretches of assembled sequence that represent all, or a major fraction of, the Amborella genome. Coverage of two finished BAC contigs (10) by assembled sequence contigs suggests that these two regions were faithfully represented in the assembly (figs. S9 and S10) (9), and all 155 of our N90 scaffolds incorporate physically mapped BAC-end sequences.

The accuracy of the genome assembly was further assessed by FISH analysis (9). BACs assembled in 104 scaffolds containing 430 Mb (68%) of the genome assembly were cytogenetically localized by FISH to assess scaffold integrity (Fig. 1, fig. S11, and table S8). This analysis confirmed contiguity across major regions (56%) of 66 scaffolds containing 306 Mb (44%) of the genome assembly. Notably, co-assembled BACs that were cytogenetically mapped to different chromosomes indicated potential misassemblies in only two scaffolds (table S8). A karyotyping cocktail differentially labeled all 13 Amborella chromosome pairs and anchored major sections of 35 FISH-validated scaffolds to the karyotype (Fig. 2). In total, the cytogenetic cocktail directly placed 101 Mb (58%) of scaffolds with a total length of 176 Mb (~25%) of the assembly onto chromosomes (table S8). However, multiple BACs from 37 scaffolds containing 154 Mb produced inconclusive genome-wide centromeric signals. Sequence alignments associated with the promiscuous probes indicate extensive sequence similarity and the presence of tandem repeats associated with the centromeric regions of the Amborella chromosomes.

Fig. 1 FISH support of scaffold 7.

Two BACs, AT_SBa0003A05 (green) and AT_SBa0003H23 (red), localize 8.2 Mb apart within the assembly scaffold 7 (9.5 Mb). Their colocalized FISH signals unambiguously support the assembly contained between their positional coordinates. Secondary green signals represent repetitive elements in AT_SBa0003A05.

Fig. 2 FISH karyotype for A. trichopoda.

BAC probes differentially label all chromosome pairs (one pair distinguished by the lack of fluorescent signal) and anchor 35 scaffolds (176 Mb) to the karyotype. Uniquely labeled chromosomes in the cytogenetic preparation (center) are arranged into homologous pairs (upper panel). Chromosomal assignments and sizes of cytogenetically localized scaffolds are tabulated.

Despite the extensive contiguity of the current draft assembly, gaps remain. Rather than constructing additional PE libraries to improve contiguity, a gap closure method based on whole-genome (formerly optical) mapping technology was undertaken in collaboration with OpGen, Inc. (Gaithersburg, MD, USA). Whole-genome mapping (14, 15) permits assembly of whole-genome restriction endonuclease maps by digesting immobilized DNA molecules and determining the size and order of fragments.

We compared assembled scaffold sequences to single-molecule restriction maps generated with Amborella genomic fragments to identify potential joins and produce superscaffolds (9) (table S10). This improved our original assembly by a 2× increase in both N50 (4.9 to 9.3 Mb) and N90 (1.2 to 2.9 Mb) (table S5). Thirty joins were confirmed through a new assembly constructed after adding an additional 454 PE sequences and improving data filtering, and 20 joins were confirmed by FISH (9) (table S10).

The Amborella assembly, as well as several recent plant whole-genome draft sequences (13, 16, 17), benefited from available collections of BAC-end sequences (10) that serve as very long (>150 kb) PE libraries. However, BAC clone resources are expensive and time-consuming to construct and evaluate, as is end-sequencing by low-throughput and high-cost Sanger sequencing. Therefore, as improvement in NGS technologies enables more nonmodel eukaryote whole-genome sequence projects, it is important to identify methods that permit long, accurate assemblies in the absence of large-insert clone resources. Superscaffolding facilitated by Genome-Builder can substitute for BAC-end sequences, as illustrated by our construction of an Amborella assembly (9) (tables S11 to S13). Although BACs were used as FISH probes in this study, they are not required for cytogenetic validation of an assembly; alternatively, probes could be developed using polymerase chain reaction amplification. Thus, sequencing is no longer a limiting factor, and the greatest challenge for many organisms will be accurate and highly contiguous genome assembly. A combination of FISH and whole-genome mapping, in concert with sequence filtering and assembly strategies described here, should prove successful even for genomes with a more complex repeat structure than that of Amborella.

Supplementary Materials

www.sciencemag.org/content/342/6165/1516/suppl/DC1

Materials and Methods

Figs. S1 to S13

Tables S1 to S13

References (1837)

References and Notes

  1. Materials and methods are available as supplementary materials on Science Online.
  2. Acknowledgments: Funded by grant 0922742 from the NSF-PGRP: National Science Foundation Plant Genome Research Program to V.A.A, W.B.B., C.W.D., J.L.M., S.R., D.E.S., and P.S.S. Sequence data are available from the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA) under accession PRJNA212863 and NCBI BioProject ID 212863. Assemblies and additional data are available at http://www.amborella.org; FISH data and probe details are available at http://app.tolkin.org/projects/88. We acknowledge R. Winer (Roche) for technical assistance.
View Abstract

Stay Connected to Science

Navigate This Article