Report

Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays

See allHide authors and affiliations

Science  01 Jan 2010:
Vol. 327, Issue 5961, pp. 78-81
DOI: 10.1126/science.1181498

Abstract

Genome sequencing of large numbers of individuals promises to advance the understanding, treatment, and prevention of human diseases, among other applications. We describe a genome sequencing platform that achieves efficient imaging and low reagent consumption with combinatorial probe anchor ligation chemistry to independently assay each base from patterned nanoarrays of self-assembling DNA nanoballs. We sequenced three human genomes with this platform, generating an average of 45- to 87-fold coverage per genome and identifying 3.2 to 4.5 million sequence variants per genome. Validation of one genome data set demonstrates a sequence accuracy of about 1 false variant per 100 kilobases. The high accuracy, affordable cost of $4400 for sequencing consumables, and scalability of this platform enable complete human genome sequencing for the detection of rare variants in large-scale genetic studies. Genotyping technologies have enabled the routine assessment of common genetic variants at up to a million sites across the genome in thousands of individuals (1) and have increased our understanding of human genetic diversity and its biological and medical impact. Whole-genome sequencing costs have dropped from the >$100 million cost of the first human genomes (2, 3) to the point where individual labs have generated genome sequences in a matter of months for material costs of as low as 48,000 (412) (table S5). Sequencing technologies, which use a variety of genomic microarray construction methodologies and sequencing chemistries (1332), can determine human genetic diversity over an entire genome and identify common as well as rare single-nucleotide polymorphisms (SNPs), insertions, and deletions. Despite these advances, improvements are still needed to enable the cost-effective characterization of the many hundreds of genomes required for genetic studies of complex diseases and for personalized disease prevention, prognosis, and treatment. We generated sequencing substrates [Fig. 1A and supporting online material (SOM)] by means of genomic DNA (gDNA) fragmentation and recursive cutting with type IIS restriction enzymes and directional adapter insertion (Fig. 1B and fig. S1). The resulting circles were then replicated with Phi29 polymerase (RCR) (33). Using a controlled, synchronized synthesis, we obtained hundreds of tandem copies of the sequencing substrate in palindrome-promoted coils of single-stranded DNA, referred to as DNA nanoballs (DNBs) (Fig. 1C). DNBs were adsorbed onto photolithographically etched, surface-modified (SOM) 25- by 75-mm silicon substrates with grid-patterned arrays of ~300-nm spots for DNB binding (Fig. 1C). The use of patterned arrays increased DNA content per array and image information density relative to random genomic DNA arrays (6, 9, 11, 14, 28). High-accuracy combinatorial probe anchor ligation (cPAL) sequencing chemistry was then used to independently read up to 10 bases adjacent to each of eight anchor sites (Fig. 1D), resulting in a total of 31- to 35-base mate-paired reads (62 to 70 bases per DNB). cPAL is based on unchained hybridization and ligation technology (15, 27, 28, 31), previously used to read 6 to 7 bases from each of four adapter sites (26 total bases) (28), here extended using degenerate anchors to read up to 10 bases adjacent to each of the eight inserted adapter sites (Fig. 1D, right) with similar accuracy at all read positions (fig. S3). This increased read length is essential for human genome sequencing. Cell lines derived from two individuals previously characterized by the HapMap Project (34), a Caucasian male of European descent (NA07022) and a Yoruban female (NA19240), were sequenced. NA19240 was selected to allow for a comparison of our sequence to the sequence of the same genome currently being assembled by the 1000 Genome Project. In addition, lymphoblast DNA from a Personal Genome Project Caucasian male sample, PGP1 (NA20431) was sequenced because substantial data are available for biological comparisons (3537). Automated cluster analysis of the four-dimensional intensity data produced raw base reads and associated raw base scores (SOM). We mapped these sequence reads to the human genome reference assembly with a custom alignment algorithm that accommodates our read structure (fig. S4), resulting in between 124 and 241 Gb mapped and an overall genome coverage of 45- to 87-fold per genome. To assess representational biases during circle construction, we assayed genomic DNA and intermediate steps in the library construction process by quantitative polymerase chain reaction (QPCR) (fig. S2). This and mapped coverage showed a substantial deviation from Poisson expectation with excesses of both high and low coverage regions (fig. S5), but only a few percent of bases have coverage insufficient for assembly (Table 1). Much of this coverage bias is accounted for by local GC content in NA07022, a bias that was significantly reduced by improved adapter ligation and PCR conditions in NA19240 (fig. S5); the fraction of the genome with less than 15-fold coverage was accordingly reduced from 11% in NA07022 to 6.4% in NA19240, despite the latter having 25% less total coverage (Table 1). Table 1 Summary information from mapping and assembly of three genomes. All variations are with respect to the National Center for Biotechnology Information (NCBI) version 36 human genome reference assembly. Novel variations were ascertained by comparison to dbSNP [JDW, release 126; NA18507 (6), release 128; all other genomes, release 129]. NA18507 and NA19240 are Yoruban HapMap samples, which may explain the number of SNPs and novelty rates. In partially called regions of the genome, one allele could be called confidently but not the other. The high call rate in NA19240 reflects reduced library bias (fig. S5). View this table: Discordance with respect to the reference genome in uniquely mapping reads from NA07022 was 2.1% (range 1.4% to 3.3% per slide). However, considering only the highest scoring 85% of base calls reduced the raw read discordance to 0.47%, including about 0.1% of true variant positions. Mapped reads were assembled into a best-fit, diploid sequence with a custom software suite employing both Bayesian and de Bruijn graph techniques (SOM). This process yielded diploid reference, variant, or no-calls at each genomic location with associated variant quality scores. Confident diploid calls were made for 86% to 95% of the reference genome (Table 1), approaching the 98% that can be reconstructed in simulations. The 2% that is not reconstructed in simulations is composed of repeats that are longer than the ~400 base inserts used here and of high enough identity to prevent attribution of mappings to specific repeat copies. Longer mate-pair inserts minimize this limitation (6, 9). Similar limitations affect other short-read technologies. We identified a range of 2.91 to 4.04 million SNPs with respect to the reference genome, 81% to 90% of which are reported in the database dbSNP, as well as short indels and block substitutions (Table 1 and table S6). Because of the use of local de novo assembly, indels were detected in sizes ranging up to 50 bp. As expected, indels in coding regions tend to occur in multiples of length 3, indicating the possible selection of minimally impacting variants in coding regions (fig. S6). As an initial test of sequence accuracy, we compared our called SNPs with the HapMap phase I/II SNP genotypes reported for NA07022 (1). We fully called 94% of these positions with an overall concordance of 99.15% (Table 2) (the remaining 6% of positions were either half-called or not called). Furthermore, we fully called 96% of the Infinium (Illumina, San Diego, CA) subset of the HapMap SNPs with an overall concordance rate of 99.88%, reflecting the higher reported accuracy of these genotypes (34). Similar concordance rates with available SNP genotypes were observed in NA19240 (with a call rate of over 98%) and NA20431 (table S7). Table 2 Concordance with genotypes for NA07022 generated by the HapMap Project (release 24) and the highest quality Infinium assay subset of those genotypes, as well as genotyping on Illumina Infinium 1M assay. Discordances with reported HapMap Infinium genotypes were verified by Sanger sequencing (SOM). View this table: We further characterized 134 of the 168 calls that were discordant with Infinium loci and Sanger sequencing of PCR products in NA07022, demonstrating that 55% of these discordances are errors in the reported HapMap genotypes (Table 2). The relationship between detection rate and read depth for about 1M Infinium HD SNPs that we subsequently genotyped in NA07022 shows that coverage of 30-fold at a position is sufficient to detect 90% of SNPs at heterozygous loci and 97% of SNPs at homozygous loci (fig. S5). Because the whole-genome false-positive rate cannot be accurately estimated from known SNP loci, we tested a random subset of novel nonsynonymous variants in NA07022, a category that is enriched for errors (10). We extrapolated error rates from the targeted sequencing of 291 such loci and estimated the false-positive rate at about one variant per 100 kb, including <6.1 substitution-, <3.0 short deletion-, <3.9 short insertion- and <3.1 block variants per Mb (Table 3 and table S8). Table 3 False-positive rates and false discovery rates (FDRs) were calculated for the entire set of variations called in NA07022 by extrapolating the heterozygous (Het) FDRs calculated from comparative Sanger sequencing of 291 selected novel variants (table S8) to all variants. This is a conservative approach (detailed in SOM). The total number of all types of false-positive variants is estimated at 7.5 to 16.1 per Mb. View this table: Aberrant mate-pair gaps may indicate the presence of length-altering structural variants and rearrangements with respect to the reference genome. A total of 2,126 clusters of such anomalous mate-pairs were identified in NA07022. We performed PCR-based confirmation of one such heterozygous 1500-base deletion (fig. S7). More than half of the clusters are consistent in size with the addition or deletion of a single Alu repeat element. Some applications of complete genome sequencing may benefit from maximal discovery rates, even at the cost of additional false positives, whereas for others, a lower discovery rate and lower false-positive rate may be preferable. We used the variant quality score to tune call rate and accuracy (fig. S8). Additionally, novelty rate (relative to the dbSNP) is also a function of variant quality score (fig. S9). We processed the NA07022 data with Trait-o-matic (Scalable Computer Experts, Somerville, Massachusetts, and Church Laboratory, Harvard Medical School, Cambridge, Massachusetts) automated annotation software [as in (12)], yielding 1159 annotated variants, 14 of which may have disease implications (table S10). Because the DNB sequencing substrates are produced by rolling-circle replication (33) in a uniform-temperature, solution-phase reaction with high template concentrations (>20 billion per ml), this system avoids substantial selection bottlenecks and nonclonal DNBs. This circumvents the stochastic inefficiencies of approaches that require precise titration of template concentrations for in situ clonal amplification in emulsion (9, 14, 29) or bridge PCR (6, 19). Our patterned arrays include high-occupancy and high-density nanoarrays self-assembled on photolithography-patterned, solid-phase substrates through electrostatic adsorption of solution-phase DNBs and yield a high proportion of informative pixels (site occupancies >95%) (fig. S12A) compared with random-position DNA arrays. This results in several hundred reaction sites in the compact (~300-nm diameter) DNB that produce bright signals useful for rapid imaging of the sequences (SOM). Such small DNBs also allow for high-density arrays. The data set reported herein was generated with arrays with ~350 million spots at a pitch of 1.29 μm. Such a spot density and higher ones achieved in proof-of-concept experiments (fig. S12B) result in high image efficiency and reduced reagent consumption that enable high sequencing throughput per instrument critical for high-scale human genome sequencing for research and clinical applications. Both sequencing by synthesis (SBS) and sequencing by ligation (SBL) use chained reads, wherein the substrate for cycle N + 1 is dependent on the product of cycle N; consequently, errors may accumulate over multiple cycles and data quality may be affected by errors (especially incomplete extensions) occurring in previous cycles. Thus, reactions need to be driven to near completion with high concentrations of expensive high-purity labeled substrate molecules and enzymes. The independent, unchained nature of cPAL avoids error accumulation and tolerates low-quality bases in otherwise high-quality reads, thereby decreasing reagent costs. The average sequencing consumables cost for these three genomes was under4400 (table S5). The raw base and variant call accuracy achieved compares favorably with other reported human genome sequences (212).

Because the sequencing substrates are produced by a DNA engineering process based on modified nick-translation for directional adapter insertion (SOM), we obtained over 90% yield in adapter ligation and low chimeric rates of about 1% (SOM). DNA molecules with an inserted adapter are further enriched with PCR (SOM). This recursive process can be implemented in batches of 96 samples and extended by inserting additional adapters to read 120 bases or more per DNB (fig. S10). The current read length is comparable to other massively parallel sequencing technologies (612).

The sequence data reported here achieve sufficient quality and accuracy for complete genome association studies, the identification of potentially rare variants associated with disease or therapeutic treatments, and the identification of somatic mutations. The low cost of consumables and efficient imaging may enable studies of hundreds of individuals. The higher accuracy and completeness required for clinical diagnostic applications provides an incentive for continued improvement of this and other technologies.

Supporting Online Material

www.sciencemag.org/cgi/content/full/1181498/DC1

Materials and Methods

Figs. S1 to S12

Tables S1 to S9

References

• These authors contributed equally to this work.

• Present address: Ion Torrent Systems, San Francisco, CA 94158, USA.

• § Present address: San Diego State University, San Diego, CA 92115, USA.