Genome-Wide Detection of Polymorphisms at Nucleotide Resolution with a Single DNA Microarray

See allHide authors and affiliations

Science  31 Mar 2006:
Vol. 311, Issue 5769, pp. 1932-1936
DOI: 10.1126/science.1123726


A central challenge of genomics is to detect, simply and inexpensively, all differences in sequence among the genomes of individual members of a species. We devised a system to detect all single-nucleotide differences between genomes with the use of data from a single hybridization to a whole-genome DNA microarray. This allowed us to detect a variety of spontaneous single–base pair substitutions, insertions, and deletions, and most (>90%) of the ∼30,000 known single-nucleotide polymorphisms between two Saccharomyces cerevisiae strains. We applied this approach to elucidate the genetic basis of phenotypic variants and to identify the small number of single–base pair changes accumulated during experimental evolution of yeast.

Despite the ongoing development of DNA sequencing technology (1, 2), it remains technically and financially infeasible for individual laboratories to sequence whole genomes. Moreover, for global comparisons of genomes within species, where one expects a relatively small number of sequence differences throughout the genome, determining the entire sequence is unnecessary. In such cases, it is sufficient to assess the extent and location of sequence variation in a manner analogous to comparative genomic hybridization, which compares copy number changes between closely related genomes at genic resolution (3).

DNA microarrays of short oligonucleotides designed to interrogate each base individually (i.e., resequencing arrays) have been applied to the analysis of individual human genes (4) and small genomes such as the human mitochondrial (5) and the SARS coronavirus (6) genomes. However, extension of this approach to whole genomes of most organisms is currently impractical because of the large number of probes required for complete coverage.

An alternative approach uses microarrays that detect mismatches, exploiting the fact that hybridization to a short oligonucleotide is quantitatively sensitive to the number and position of mismatches (7). Sequence-level differences are detected, without allele-specific probes, by comparing hybridization intensities of individual features on the microarray [referred to as single-feature polymorphisms (SFPs) (8)]. This method has been successfully applied to studies of genetic diversity (911) and gene mapping (1217). Until recently, comprehensive detection of single–base pair differences has been limited by probe density across the genome, which is typically a few oligonucleotides per gene. Even complete single-copy coverage of the genome is unlikely to be sufficient for finding all mutations, because statistically detectable decreases in hybridization intensity usually require that a variant nucleotide fall within the central 15 bases of a 25-base probe (18).

We used high-density Affymetrix yeast tiling microarrays (YTMs) with overlapping 25-nucleotide oligomers spaced an average of 5 base pairs (bp) apart to provide complete and ∼5-fold redundant coverage of the entire S. cerevisiae genome. This array design was previously used to discover novel expressed sequences and to precisely map sites of transcription in humans (19). This design provides five to seven measurements of a given nucleotide's effect on hybridization efficiency, which we exploited to predict the presence and location of SNPs and deletion breakpoints throughout the entire yeast genome.

Each YTM has ∼2.6 million perfect match (PM) probes and ∼2.6 million corresponding mismatch (MM) probes. We modeled the decrease in PM probe intensity caused by a single SNP as a function of the SNP's position within the probe, the probe's GC content, the nucleotide sequence surrounding the SNP, and the hybridization intensity obtained using a nonpolymorphic reference (S288C) genome [strain FY3 (20)]. To fit the model, we used hybridization data for a training set of nearly 25,000 high-quality SNPs in strain RM11-1a, all identified by direct comparison of the genomic sequences (20). The model predicts the intensity of a probe in the presence of a specified SNP (20) (figs. S1 and S2) and is used in our algorithm, SNPscanner, which calculates the log of the likelihood ratio (the “prediction signal”) for the presence of a SNP at each nucleotide position in the genome using measurements from all probes that cover that site. By scanning the entire genome, we identify SNPs as regions of elevated signal in which the position of the peak value is considered the predicted polymorphic site.

We tested the performance of SNPscanner on a set of 981 high-quality SNPs from RM11-1a that were not included in the training set. We assessed the false-positive rate by using SNPscanner to predict SNPs from an independent hybridization of the reference strain, where no true polymorphisms are expected. At a prediction signal of 1, we detected 915 (93.3%) known SNPs in RM11-1a and called 177 false positives in the reference strain (fig. S3). By increasing the prediction signal to 5 and applying a heuristic filter (20), we eliminated all false positives and retained 77.5% (760) of real SNPs. Analysis of this set of correctly predicted SNPs showed the sequence-confirmed SNP to be within 2 bp of the predicted site 87.1% of the time (20).

To test our ability to predict a large number of SNPs, we analyzed the highly diverged sequenced strain YJM789, originally recovered from an AIDS patient (21). We selected a set of 30,303 sequence-confirmed SNPs in YJM789 that were isolated from each other by at least 25 bp and were covered by probes on the YTM. Analysis of a single hybridization with SNPscanner yielded 28,737 (94.8%) correctly predicted SNPs at a prediction signal threshold of 1 (Fig. 1). At a prediction signal threshold of 5, we detected 86.9% of known SNPs and called only eight false positives in a similar analysis of the reference genome. These false positives were readily excluded by our heuristic filter.

Fig. 1.

Nucleotide-level comparison with a genome divergent from the sequenced reference genome. We applied our approach to test how many of 30,303 known SNPs in the yeast strain YJM789 we were able to detect. Numbers on the graph indicate prediction signal thresholds. On the basis of data from a single hybridization experiment, we were able to correctly identify as many as 28,737 SNPs at a prediction signal of 1. At prediction signals of >5, the number of false-positive predictions is reduced to 8 in a test of the reference genome and 86.9% of true positives are still predicted.

To test our ability to detect accurately a very small number of sequence differences that distinguish two genomes, we analyzed spontaneous mutants in the strain FY3. Independent clones from the same archival isolate were grown, and mutants in the CAN1, GAP1, and FCY1 genes were selected on plates containing canavanine sulfate, d-serine and d-histidine, or 5-fluorocytosine, respectively (20). For each mutant we hybridized total genomic DNA to a single YTM and analyzed the data with SNPscanner (Fig. 2, A and B). In each of four can1 mutants, we detected a single peak at the CAN1 locus that fulfilled our prediction criteria for a SNP (Fig. 2C). Amplification and sequencing of the CAN1 locus identified a single-base substitution in each of three mutants (31844G → T; 32064C → G; 32757G → C) and deletion of a single thymine in a run of four thymines in the fourth (32924ΔT). Although the prediction signal for this deletion was comparatively low, its detection is noteworthy because no insertions or deletions (indels) were included in the set of SNPs used to train the model.

Fig. 2.

SNPscanner accurately predicts SNPs in CAN1 for independent CANR mutants. (A) Multiple overlapping probes cover each nucleotide. A mutation at the site indicated in red perturbs hybridization of the sample to all probes. (B) The decrease in observed hybridization is used to estimate the log of the likelihood ratio of the presence of a polymorphism versus the absence of a polymorphism (the prediction signal). The presence of a SNP typically results in a region of positive prediction signal with a peak defined as the predicted SNP; for the confirmed mutation indicated in red text in (A), the entire sequence in green has a positive prediction signal shown in (B). (C) Using this approach, we detected single–base pair substitutions and a 1-bp deletion in four independent spontaneous CANR mutants isolated in a reference genome background (each color represents a different experiment). (D) SNPscanner accurately predicts mutations and SNPs in a nonreference genome. The results of nine independent CANR mutants in the CEN.PK strain background are shown for the entire CAN1 gene. We confirmed unique nucleotide substitutions for seven of the mutants, as well as a single-base insertion in one mutant and a single-base deletion in another. At common polymorphisms, indicated in red text, the SNPscanner signal is highly reproducible across multiple samples, allowing intrastrain comparisons of nonreference genomes.

Analysis of DNA from a mutant resistant to d-histidine and d-serine predicted a mutation in GAP1 (chromosome XI), which we confirmed as a 514919C → G substitution by sequence analysis (fig. S4). Similarly, we accurately predicted a mutation in FCY1 (chromosome XVI) for a mutant resistant to 5-fluorocytosine (677256C → T; fig. S5). Thus, we were able to detect a variety of single-base changes, including a single-base deletion, at several different loci in the genome and map them to within 2 bp of the verified substitution (table S1).

In addition to the anticipated mutations, our analysis yielded 12 to 414 additional predictions per genome (table S2). We identified two main causes of experimental noise: (i) false positives that fell within repetitive genomic features, such as retrotransposons and telomeres, which we subsequently excluded (table S1); and (ii) manufacturing defects in microarrays, which we computationally removed (20) (fig. S6). We ranked the remaining predictions on the basis of signal strength for each mutant and found the expected mutation in the top five predictions for all mutants except the one resulting from an indel (table S1). One SNP prediction (chromosome IV, position 548,350, sequence confirmed as 548348G → C) was common to all samples, suggesting an early mutation event that preceded later experiments (perhaps during single-colony purification from the archived stock culture). Sequence confirmation of high-quality predictions passing our filtering criteria identified additional unique mutations in three of the six spontaneous mutants (table S1). Thus, our algorithm is sufficiently sensitive to detect a small number of base changes that distinguish two genomes with no a priori knowledge of the variants' location. These results indicate that only a small number of mutations (<5) are associated with the generation of spontaneous drug resistance mutants.

We extended our approach to characterize the genome of the unsequenced laboratory yeast strain CEN.PK, commonly used in continuous culture experiments. CEN.PK shares ancestry with the reference strain, S288C, but some genes are absent in CEN.PK (22). We obtained a nucleotide-resolution comparison with the reference sequence by analyzing data with SNPscanner from a single hybridization of CEN.PK DNA. CEN.PK has a strikingly mosaic structure, with large portions of the genome sharing essentially complete sequence identity with FY3 interspersed with regions of sequence divergence and large deletions (fig. S7).

We investigated whether we could detect single mutations on a genome-wide scale in a nonreference genome; this was expected to be a more difficult statistical problem (20). We selected 10 spontaneous CanR mutants in the CEN.PK strain background and hybridized genomic DNA to the YTM. SNPscanner predictions correctly identified a mutation in 9 of 10 mutants, as well as three polymorphic sites present in the wild-type CEN.PK background (Fig. 2D). We confirmed the sequences of all CAN1 mutations and polymorphisms in the 10 CanR mutants. Whereas seven of the nine detected mutants had base substitutions in CAN1, one mutant contained a 1-bp insertion and another had a 1-bp deletion. All mutations were confirmed as lying within 7 bp of the predicted site (table S2).

SNPscanner prediction signals were highly reproducible across multiple experiments. We compared genome-wide SNP predictions for each CEN.PK can1 mutant to SNP predictions for CEN.PK wild-type DNA and applied our heuristic filter (20). This resulted in the prediction of fewer than 100 SNPs genome-wide that were not predicted to exist in wild-type CEN.PK for 9 of 10 mutants (table S2). In most cases, excluding those predictions that fell in repetitive regions further reduced the total number. By using this approach, we retained the identified can1 mutation for seven of nine mutants. We ranked the remaining predictions and observed that the sequence-confirmed mutation was in the top 10 predictions for all seven mutants. So even in this somewhat more challenging case, our system succeeded in detecting most of the single-nucleotide sequence differences and mapping them within a few nucleotides. Mutations predicted in our collection of CEN.PK and FY3 spontaneous mutants corresponded to 9 of the 12 possible base substitutions that resulted in six of the eight possible mismatches between probe and sample. Thus, our method can detect single–base pair indels in addition to virtually all base substitutions.

We sought to apply our genome-wide mutation detection approach to biological questions that had remained refractory to traditional genetic techniques. We complemented a positional cloning project to predict and confirm mutations in AEP3, a peripheral mitochondrial inner membrane protein (23), that are causative of a growth defect on a nonfermentable carbon source (20) (table S3). We also used our method to determine the genetic basis of an unusual phenotype. Deletion of AMN1 results in up-regulation of daughter-specific genes and a nonclumpy growth phenotype (17). However, when we deleted AMN1 in an S288C-like strain (BY4716), we recovered a transformant that displayed low expression of daughter-specific genes and a clumpy phenotype (strain YEF1695). Deletion of AMN1 in YEF1695 was confirmed by sequence analysis, and independent deletions of AMN1 in both BY4716 and RM11-1a yielded the expected phenotype, which suggested the presence of a suppressor mutation in YEF1695. Preliminary genetic analysis tended to indicate the presence of an unlinked suppressor mutation. We hybridized genomic DNA from YEF1695 to the YTM. Analysis using SNPscanner confirmed the deletion of AMN1 (24) and identified an additional deletion on chromosome XII (Fig. 3A). The predicted deletion spans ∼1.5 kb and includes the majority of the coding region of ACE2. Subsequent sequence analysis confirmed that the predicted breakpoints were within 2 bp of the actual sites (Fig. 3).

Fig. 3.

Genome-wide mutation detection facilitates a genomic approach togenetics. Whole-genome analysis of a strain in which AMN1 was deleted but that failed to demonstrate the expected nonclumpy phenotype predicted the presence of a 1562-bp deletion (defined by the outermost peak values in prediction signal) in ACE2 (shown in its entirety). Sequence analysis confirmed the deletion—which spans 1558 bp and is flanked by the nucleotide sequence CTG—and mapped the breakpoints to nucleotides 404,621 to 406,179.

The deletion of ACE2 provides a plausible explanation for both aspects of the aberrant phenotype of YEF1695. ACE2 encodes a transcription factor that is thought to drive the transcription of genes with daughter-specific expression (25). Its absence in YEF1695 probably causes a low expression of the daughter cell–specific genes, some of which are required for cell separation after budding (e.g., CTS1, which encodes chitinase). Moreover, deletion of ACE2 alone results in a clumpy phenotype (26), and clumpiness segregates with the ACE2 locus in a cross between YEF1695 and RM11Δamn1 (24).

Previous studies have shown the occurrence of characteristic gene expression patterns (27) and large-scale gene duplication and deletion (28) in yeast cultures that are experimentally evolved under a nutrient-limiting condition. However, the extent and nature of nucleotide changes that occur during this process have remained completely unknown. We sought to assess the degree of sequence variation that had accumulated in a strain of yeast subjected to experimental evolution under sulfur limitation in continuous culture. We compared the SNPscanner signals obtained from DNA of the ancestral strain, CEN.PK, to those signals obtained from DNA of two clones from the same population that had undergone experimental evolution under sulfur limitation for 63 (DBY11130) and 123 (DBY11131) generations. We compared our set of predictions to those made for CEN.PK CANR mutants to exclude common predictions that were the result of systematic error. SNP predictions that fell within repetitive regions were considered to be unreliable and were excluded from further analysis.

At a prediction signal of >5 we called a small number of predicted SNPs in strains DBY11130 and DBY11131, 12 of which were common to the two strains (Table 1). We confirmed the sequences of single strain-specific mutations found in DBY11130 and DBY11131 (Table 1). The relatively small number of mutations strongly suggests that the events associated with adaptive evolution in chemostats do not involve even transient genome-wide mutagenesis; this number is also consistent with the experience that in yeast, evolved strains are rarely if ever found to have mutator phenotypes (24). This is in contrast to studies of Escherichia coli grown in batch conditions, in which mutator phenotypes have been observed in numerous independent cultures (29). The small number of mutations identified in our experiments means that it will be feasible to comprehensively identify and experimentally verify mutations that are important for adaptation during studies of experimental evolution.

Table 1.

Predicted SNPs detected in a yeast strain subjected to experimental evolution.

StrainGenerations under sulfur limitationNumber of SNPscanner predictionsSequence-confirmed mutation
DBY11130 63 19 unique SNPs Chromosome IV, 498631C → A in REG1 (D749Y)
DBY11131 123 6 unique SNPs Chromosome VII, 858403G → C in TIM13 (A38P)
DBY11130 and DBY11131 12 shared SNPs

On the basis of a single experimental hybridization, we are able to accurately detect the single-nucleotide changes that distinguish two genomes. Recently, a similar microarray design has been used as a preliminary screen to identify possible mutations in the pathogen Helicobacter pylori (30). In this method, the initial screen is followed by the manufacture of targeted resequencing microarrays. Our method relies on only a single experiment to derive a statistical measure of the likelihood of a polymorphism at a particular site. Our approach is optimal when direct comparisons are made to the reference strain represented on the microarray. However, we are also able to compare two nonreference genomes and identify the SNPs that distinguish them with only minimal added cost in terms of false negatives and false positives. Although our algorithm is trained on a set of known base substitutions, we found that it also detected single-base deletions and insertions, as well as large deletions with near-nucleotide accuracy in the prediction of break-points. Any genomic variation that results in novel sequence (such as inversions or retrotransposon insertions) should, in principle, be detectable by SNPscanner.

We expect that the simplicity and afford-ability of this method will enable individual laboratory groups to devise and use new and truly comprehensive genomic approaches to Mendelian and complex genetics and to the characterization of mutants obtained through genetic and suppressor screens. In addition, complete knowledge of nucleotide diversity will allow us to address questions regarding the mutagenic effect of phenomena such as aging and recombination on a genome-wide scale. By representing entire genomes of other organisms on oligonucleotide microarrays with a similar redundant design, it is likely that our approach may be extended to higher organisms. Although increased genome complexity presents a challenge, reports of successful SFP-based genotyping in Arabidopsis (12, 31), which has a genome of 125 Mb, suggest that genome-wide prediction of all sequence variants may be possible in larger genomes, including those of model organisms such as Caenorhabditis elegans and Drosophila melanogaster. We analyzed haploid genomes and a single homozygous diploid genome; as with all sequencing technologies, identifying heterozygosity in diploid genomes represents the ultimate challenge.

Supporting Online Material

Materials and Methods

SOM Text

Figs. S1 to S7

Tables S1 to S3


References and Notes

View Abstract

Stay Connected to Science

Navigate This Article