Large-Scale Identification, Mapping, and Genotyping of Single-Nucleotide Polymorphisms in the Human Genome

See allHide authors and affiliations

Science  15 May 1998:
Vol. 280, Issue 5366, pp. 1077-1082
DOI: 10.1126/science.280.5366.1077


Single-nucleotide polymorphisms (SNPs) are the most frequent type of variation in the human genome, and they provide powerful tools for a variety of medical genetic studies. In a large-scale survey for SNPs, 2.3 megabases of human genomic DNA was examined by a combination of gel-based sequencing and high-density variation-detection DNA chips. A total of 3241 candidate SNPs were identified. A genetic map was constructed showing the location of 2227 of these SNPs. Prototype genotyping chips were developed that allow simultaneous genotyping of 500 SNPs. The results provide a characterization of human diversity at the nucleotide level and demonstrate the feasibility of large-scale identification of human SNPs.

Although the Human Genome Project still has tremendous work ahead to produce the first complete reference sequence of the human chromosomes, attention is already focusing on the challenge of large-scale characterization of the sequence variation among individuals (1). This genetic diversity is of interest because it explains the basis of heritable variation in disease susceptibility, as well as harbors a record of human migrations.

The most common type of human genetic variation is the SNP, a position at which two alternative bases occur at appreciable frequency (>1%) in the human population. There has been growing recognition that large collections of mapped SNPs would provide a powerful tool for human genetic studies (1, 2). SNPs can serve as genetic markers for identifying disease genes by linkage studies in families, linkage disequilibrium in isolated populations, association analysis of patients and controls, and loss-of-heterozygosity studies in tumors (1, 2). Although individual SNPs are less informative than currently used genetic markers (3), they are more abundant and have greater potential for automation (4, 5).

We performed an initial survey to identify SNPs by using conventional gel-based DNA sequencing to examine sequence-tagged sites (STSs) distributed across the human genome. STSs are short genomic sequences that can be amplified from DNA samples by means of a corresponding polymerase chain reaction (PCR) assay. From among 24,568 STSs used in the construction of a physical map of the human genome at the Whitehead Institute for Biomedical Research/MIT Center for Genome Research (6, 7), an initial collection of 1139 STSs was chosen (8). These STSs contained a total of 279 kb of genomic sequence (9), with one-third from random genomic sequence and two-thirds from 3′-ends of expressed sequence tags (3′-ESTs) and primarily representing untranslated regions of genes. Each STS was amplified from four samples (10): three individual samples and a pool of 10 individuals (thereby permitting allele frequencies to be estimated among 20 chromosomes). The PCR products were subjected to single-pass DNA sequencing based on fluorescent-dye primers and gel electrophoresis; sequence traces were compared by a computer program followed by visual inspection (11). Candidate SNPs were declared when two alleles were seen among the three individuals, with both alleles present at a frequency greater than 30% in the pooled sample. The term “candidate SNP” is used because a subset of such apparent polymorphisms turn out to be sequencing artifacts, as discussed below.

The survey identified 279 candidate SNPs, distributed across 239 of the STSs. This corresponds to a rate of one SNP per 1001 base pairs (bp) screened and an observed nucleotide heterozygosity ofH = 3.96 × 10–4 (Table1). Expressed sequences (3′-ESTs) showed a lower polymorphism rate than random genomic sequence (with the difference falling just short of statistical significance atP = 0.057, one-sided), consistent with greater constraint within genic sequences. The ratio of transitions to transversions was 2:1. Although the dinucleotide CpG makes up only about 2% of the sequence surveyed, nearly 25% of the SNPs occurred at such sites with the substitution almost always being C↔T. Cytosine residues within CpG dinucleotides are the most mutable sites within the human genome, because most are methylated and can spontaneously deaminate to yield a thymidine residue (12). In addition to the single-base substitutions, 23 insertion-deletion polymorphisms were also found (with all but eight involving a single base), corresponding to a frequency of one per 12 kb surveyed.

Table 1

Results of SNP screening.

View this table:

Gel-based resequencing was satisfactory for the initial screen, but we sought a more streamlined approach for a larger scale SNP identification. One such approach involves hybridization to high-density DNA probe arrays (13). Such “DNA chips” can be produced with parallel light-directed chemistry to synthesize specified oligonucleotide probes covalently bound at defined locations on a glass surface or “chip” (14). A target DNA sequence of length L can be screened for a polymorphism by hybridizing a biotin-labeled sample to a variant detector array (VDA) of size 8L (Fig.1). For each position on both strands, the array has four 25-nucleotide oligomer probes complementary to the sequence centered at the position. The four differ only in that the central (13th) position is substituted by each of the four nucleotides. Homozygotes (AA) for the expected sequence should hybridize more strongly to the perfectly complementary probe than to the three probes containing a central mismatch. The presence of an SNP would be expected to give rise to a different hybridization pattern, with homozygotes (BB) showing strong hybridization to an alternative base and heterozygotes (AB) showing strong hybridization to two probes. The VDA thus signals the presence of a sequence variation (by a change in the hybridization pattern) and, in many cases, indicates the nature of the change (by a gain of signal at a specific mismatch probe). VDAs have been used for mutation detection of small, well-studied DNA targets [such as 387 bp from the human immunodeficiency virus–1 genome, 3.5 kb from the breast cancer–associated BRCA1 gene, and 16.6 kb from the human mitochondrion (13, 15)] in large numbers of samples. In this setting, the normal hybridization pattern can be characterized with precision and single-base substitutions detected with high accuracy.

Figure 1

SNP screening on chips. (A) Small portion of a VDA for an STS hybridized with the expected target sequence. Chip features in each column are complementary to successive overlapping 25-nucleotide oligomer subsequences, with the central base substituted by A, C, G, or T in the four rows. Variations from the expected sequence can usually be detected by examination of the most intense signal in each column. (B) The same VDA was hybridized with sequence containing an SNP (A→C) at position 19. The hybridization signal is now stronger at an alternative base at this position. It is also weaker at the surrounding positions (for example, positions 12 to 18 and 20 to 26), because probes at these positions are designed to be complementary to the A allele at the SNP and mismatch with the C allele.

In this project, we used VDAs in a large-scale survey. A total of 16,725 STSs covering 2 Mb of human DNA were selected, with one-third from random genomic sequence and two-thirds from 3′-ESTs. The survey used 149 distinct chip designs, each containing 150,000 to 300,000 features. The STSs were examined in seven individuals, representing about 14 Mb of genomic sequence. For each chip, the corresponding STSs were amplified from an individual, pooled together, labeled with biotin, hybridized, and stained (16), and the resulting hybridization patterns were compared by a computer program followed by visual inspection (17). At each position, samples were classified as homozygous for the expected sequence, homozygous for an alternative sequence, or heterozygous.

A collection of 2748 candidate SNPs were identified, corresponding to a rate of one per 721 bp surveyed and an observed nucleotide heterozygosity of 4.58 × 10–4 (Table 1). The number of STSs containing SNPs was 2299. The SNPs had a mean heterozygosity of 33%, with the minor allele having a mean frequency of 25%. SNPs were found less often in 3′-ESTs than in random genomic sequence (P < 0.023, one-sided), consistent with greater constraint in genic regions.

The nucleotide heterozygosity rate was indistinguishable from the estimate obtained from gel-based sequencing (P > 0.12, two-sided test), as was the ratio of transitions to transversions and the proportion of SNPs occurring at CpG dinucleotides. SNPs were detected at a higher frequency in the chip-based survey because more samples were surveyed (seven versus three individuals). The observed increase of 38.8% (1/721 versus 1/1001) agreed closely with expectation under classical population genetic theory (18). This result has implications for the choice of sample size for an SNP survey (19).

We estimated the error rates in the gel-based and chip-based surveys. The false-positive rate was estimated by carefully confirming candidate SNPs found in each survey by using thorough multipass sequencing (20): 12% of 220 candidate SNPs found in the chip-based survey and 16% of 120 candidate SNPs found by single-pass gel-based sequencing were false positives. The false-negative rate was estimated by considering a subset of STSs that had been included in both surveys: these STSs yielded 55 SNPs (all carefully confirmed to eliminate false positives), of which eight (15%) were missed by single-pass gel-based resequencing and seven (13%) were missed by the chip-based survey. Many of the errors were due to random factors, in that they were eliminated simply by repeating the original experiment. However, some were reproducible artifacts that could be eliminated only by changing the detection protocol (for example, by using dye terminators rather than dye primers in gel-based sequencing). The gel-based sequencing and chip-based analysis had similar rates of accuracy—with a false positive and false negative being found roughly every 5000 to 10,000 bases, or about 10% of the true SNP frequency. The accuracy largely reflects the particular implementation of the technologies in a high-throughput setting and could be increased at the expense of assay optimization.

Although the two surveys yielded comparable accuracy, the survey based on VDAs required considerably less laboratory work than gel-based resequencing. Both approaches required amplifying target loci. The gel-based approach then required a sequencing reaction and electrophoresis on each individual locus, whereas the chip-based approach allowed targets totaling 30 kb to be pooled into a single labeling reaction and hybridized (21).

The SNP collection from the two surveys was supplemented by two directed approaches based on public databases. First, we collected reports from the literature of common variants in gene coding regions. We were able to confirm 120 of 143 cases tested by virtue of detecting two alleles in our screening panel; the remainder may be true polymorphisms but simply monomorphic in the individuals tested. Second, the GenBank database contains multiple entries for some ESTs. Such entries were compared to identify single-nucleotide differences, which might reflect either common polymorphisms or sequencing errors in single-pass EST sequencing. We tested 200 such apparent differences and confirmed the presence of an SNP in 94 cases. These two directed approaches thus yielded an additional 214 SNPs.

The project has thus identified 3241 candidate SNPs to date. Confirmation (22) has so far been obtained for 1477 SNPs and is expected to yield ∼2900 true SNPs. All information about the SNPs has been deposited on the Whitehead/MIT Center for Genome Research Web site ( and will be updated with results of additional surveys and confirmation tests. The information is also being deposited in the GenBank database.

For SNPs to be useful in human genetic studies, they must be assembled into maps showing their chromosomal location. To create a third-generation map based on SNPs, we used whole-genome radiation-hybrid (RH) mapping (6, 7,23), which infers the position of loci based on co-retention in a panel of human-on-hamster cell lines; it has become a primary method for constructing maps of the human genome (6,7).

The current RH map of the human genome is anchored by a scaffold of 1036 genetic markers from an earlier genetic map consisting of simple sequence length polymorphisms (SSLPs) (7). SNPs can be integrated with respect to the earlier genetic map by determining their position on the RH map. We have localized 1880 STSs, containing 2227 of the 3241 candidate SNPs, on the RH map and thereby relative to the human genetic map (Fig. 2and Table 2). SNPs are not evenly distributed among chromosomes or within chromosomes because most were derived from ESTs, which are known to have an uneven distribution (6, 7). SNP-containing STSs are present at a mean spacing of 2.0 centimorgans (cM) across the genome (24), and the map contains 58 intervals greater than 10 cM. The genetic distances on the map must be regarded as approximate because they are based on interpolation from distances in the RH map. It will be desirable to reestimate these distances on the basis of direct linkage analysis in the CEPH families, as high-throughput genotyping for the complete SNP collection becomes feasible.

Figure 2

A portion of the SNP genetic map (showing human chromosome 1). The full map is available on the Whitehead Institute Web site ( Positions are based on genetic distances in centimorgans. Genetic positions of SNPs were inferred by localizing them relative to framework markers by RH mapping and then interpolating distances from centirays (on the RH map) to centimorgans (on the genetic map). Framework marker names are given in full. SNP names are named with the prefix WIAF (for example, WIAF-17), but the prefix is dropped and only the number is shown in the figure.

Table 2

Chromosomal distribution of genetic markers.

View this table:

We next developed an efficient method for large-scale genotyping of SNPs based on extending the use of DNA chips from SNP discovery to SNP genotyping (5). We synthesized genotyping chips containing “genotyping arrays” for each SNP to be tested. Each genotyping array consists of two short VDAs corresponding to the two alternative alleles (Fig. 3). The presence of an allele should be reflected in strong hybridization to the corresponding resequencing array. PCR assays were designed for the region containing each SNP (25), with the goal of being robust and mutually compatible: the amplification targets were all small (typically, a few nucleotides around the polymorphic site), the primers all had similar calculated melting temperatures, and constant sequences were added to the 5′-ends of the forward and reverse primers to facilitate batch labeling of pooled PCR products. Each assay was tested to ensure that it amplified a single fragment from genomic DNA.

Figure 3

Genotyping chips. (A) Schematic diagram of genotyping array for an SNP, consisting of two VDAs to study seven nucleotides centered around the SNP. The top and bottom arrays are designed to be complementary to the allelic sequences containing A and C, respectively. Probes perfectly matching the A and C alleles are shown in gray and black, respectively. A genotyping array for the complementary strand was also used but is not shown. (B) Hybridization signal for a genotyping array probed with samples from three individuals with respective genotypes AA, AC, and CC.

The most complex genotyping chip tested contained genotyping arrays for 558 candidate SNPs identified in the chip-based survey. Initially, the 558 loci were separately amplified, pooled, labeled, and hybridized to the chip. To determine whether each locus could be reliably read, we defined a formal detection test: loci passed if, for each of three individuals tested, the expected DNA sequence could be successfully read on both strands for one or both alleles. In all, 98% of the loci passed this detection test (with the remaining 2% failing as a result of weak hybridization or cross-hybridization).

We next sought to decrease substantially the sample preparation required to genotype large numbers of SNPs, as required to perform a genome scan. We developed a protocol based on multiplex PCR in which primer pairs from many different loci are combined in a single reaction (26). Although it is typically difficult to combine many PCR assays, the approach worked well for our SNP assays: 92% of the 558 loci passed the detection test when amplification was performed in 24 sets of ∼23 loci; 90% passed when amplified in 12 sets of ∼46 loci; 85% passed when amplified in 6 sets of ∼92 loci; and 50% passed when amplified in a single set of 558 loci. The success appears to have resulted from a combination of factors, including the small size of the amplification targets, optimization of amplification conditions, and the presence of the constant sequence at the 5′-ends of the primers (27). It may be possible to salvage the unsuccessful assays by grouping them into additional multiplex sets or by redesigning the assays.

Multiplex amplification of sets of 46 loci was used in subsequent experiments because it decreased the number of reactions by a factor of 46 while allowing the vast majority (512/558) of loci to be assayed. The procedure was further tested in 39 individuals and was quite consistent: 96% of the 512 loci could be successfully read in 100% of individuals tested and the remainder in nearly all individuals.

We next developed a genotyping algorithm for each SNP. Loci were declared to pass a cluster test if the hybridization patterns seen in a test set of 39 individuals fell into distinct clusters, corresponding to the possible genotypes (28). These clusters could then be used to assign genotypes for further samples (29).

The cluster test was applied to the ∼500 candidate SNPs that worked well under multiplex amplification conditions: 75% passed the cluster test, and careful resequencing demonstrated that all such loci were true polymorphisms. The cluster test thus provides reliable confirmation of an SNP. The remaining 25% failed the cluster test, and resequencing revealed that half were false positives in the SNP screen and half were true polymorphisms (with the poor discrimination on the chip typically due to one allele hybridizing more weakly than the other). Thus, 88% of the candidate SNPs proved to be true polymorphisms, and 86% of true SNPs passed the cluster test.

To test the reproducibility and accuracy of the genotyping method, we genotyped a set of 91 loci (passing the cluster test) in three individuals by performing chip-based genotyping on six separate occasions over a 2-month period. The correct genotypes were independently determined by thorough gel-based resequencing. The genotyping-chip assay assigned a genotype in 98% of cases (1613/1638), and this assignment proved correct in 99.9% (1611/1613) of these cases. The loci were also genotyped in two complete CEPH families. The genotypes were not independently confirmed, but they were fully consistent with mendelian segregation.

For SNPs passing the cluster test, highly accurate genotypes could thus be obtained with the simple design used here. For the remaining SNPs (14%), similar accuracy can likely be obtained but may require optimization of the genotyping array design, depending on the locus [as shown in (5)].

The SNP surveys provide data about human genetic diversity. Two classical measures of diversity (30) are H, the average heterozygosity per nucleotide, and K, the proportion of sites harboring a variation. H does not depend on sample size, whereas K increases with the number of genomes surveyed. For a population at equilibrium, the neutral theory of evolution relates H and K to the classical population genetic parameter θ = 4N eμ, whereN e is the effective population size and μ is the mutation rate per nucleotide. (θ can be thought of as twice the number of new mutations per generation arising in a population with size N e.) Specifically, H ≈ θ and K ≈ θ [1−1 + 2−1 + 3−1 + … + (n − 1)−1], provided that θ is small. From these equations, one can estimate θ based on H or K.

The human population is not at equilibrium, but rather underwent a rapid population expansion in the last 100,000 to 200,000 years. Such population explosions tend to suppress the effects of genetic drift and thus preserve the distribution of common alleles and the value of θ. Accordingly, the value of θ is relevant to the ancestral human population before its recent expansion.

The four estimates of θ derived from H and Kfor the two surveys are all roughly θ ≈ 4 × 10–4(Table 1). Assuming a mutation frequency of μ ≈ 10–8to 10–9, this would suggest an effective population size of N e ≈ 104 to 105, which seems reasonable for the ancestral population preceding the explosion in the last 100,000 years (31). Strictly speaking, these estimates apply only to the European population, from which all samples were drawn. However, a preliminary survey of a more diverse sample of 31 individuals representing all major racial groups yielded a value of θ that is only 30% larger (26), consistent with the idea that human variation occurs primarily within rather than between racial groups (32).

The resources reported here represent only a first step toward a dense SNP map of the human genome. The genetic map should already be useful for family-based linkage studies, given the average spacing (2 cM) and average heterozygosity (34%) of the markers. (The heterozygosity applies to the European-derived samples studied here, but a preliminary survey of ∼180 of the SNPs shows that most are also polymorphic in other groups.) It still remains to develop a suitable genotyping system, such as a 2000-SNP genotyping chip.

Large-scale screening for human variation is clearly feasible. Someday it may become possible to screen entire human genomes. In the nearer term, a key goal will be to extend SNP discovery to the protein coding regions of all human genes (roughly 120 Mb of sequence, only about 40 times more than the current study) in order to catalog the common variants that may explain susceptibility to common genetic traits and diseases (1).


View Abstract

Navigate This Article