Technical Comments

Are 100,000 “SNPs” Useless?

Science  22 Nov 2002:
Vol. 298, Issue 5598, pp. 1509a-1509b
DOI: 10.1126/science.298.5598.1509a

Bailey et al. (1) used public and private genome sequences to define segmental duplications within the human genome. Their excellent study demonstrated that the 5.2% of the genome present as segmental duplications contains 6.1% of known exons and roughly 100,000 more single nucleotide polymorphisms (SNPs) from the public SNP database (dbSNP) than expected. This latter feature led the authors to conclude that “about 100,000 paralogous sequence variants currently contaminate dbSNP.” In other words, these entries in dbSNP do not represent allelic variants (polymorphisms), but differences between paralogous sequences (cismorphisms). This calculation assumed that “there is no reason to expect that polymorphic variation is increased within duplicated regions.”

In contrast to this assertion, however, it has long been recognised that nonallelic gene conversion is capable of generating allelic diversity as well as homogenizing paralogous sequences. For example, the promoter of the growth hormone gene GH1 exhibits roughly 20 times more nucleotide diversity than other autosomal loci, as a consequence of gene conversion with neighbouring paralogous genes (2). Gene conversion has also been detected between dispersed segmental duplications (3–6). In addition, mathematical modeling has shown that heterozygosity increases with gene conversion rate (7).

Gene conversion between segmental duplications raises the additional possibility that etiologically important variants within them might skip between chromosomal locations, thus changing their haplotypic background and potentially rendering such regions opaque to haplotype-based whole genome association studies of complex disease. Variants defining the haplotypic background are themselves subject to gene conversion and, in view of typically short conversion tract lengths, are unlikely to be co-converted with the etiologically important variants, a factor that increases the confusion.

Investigating the extent to which variants within 6.1% of our genes might escape haplotype-based association studies, and the degree to which 100,000 is an overestimate of useless “SNPs” in these segmental duplications, will require greater characterization of the poorly understood dynamics of gene conversion in the human genome.


Response: Hurles raises an excellent point: Assembly errors may not be the sole basis for the observed “SNP” enrichment. There are at least two possible explanations: (i) duplication-induced collapse of paralogous sequence variants (PSVs) (1), and (ii) gene conversion events among the duplicated segments (2). Both events likely contribute—but which is more probable in light of the current state of the genome assembly within duplicated regions?

In previous analyses (1, 3), we found that duplicated regions were in fact underrepresented (by 30 to 40%) within public assemblies. There were fewer copies in the sequence assembly than could be shown by experimental methods (1,4). The large size of the duplication (100 kilobases) and the high degree of sequence identity between many duplications have led to such sequences being considered as allelic copies rather than representing independent loci. In this respect, it is noteworthy that “overlap” SNPs, which were largely determined by electronic comparison of Genbank sequences, contributed more significantly (2.6 times) to the enrichment compared with SNPs assigned randomly (1.28 times). In addition to collapse, subsequent examination of dbSNP has revealed that many other “overlap” SNPs are annotated as “ambiguously mapped” and are in fact assigned to more than one location (5). Thus, although gene conversion remains a likely source for some of the “SNP” abundance, this effect cannot be satisfactorily addressed without concomitant elimination of the artifacts. We think that these artifacts of our genome provide the most prosaic explanation for this increase. Further experimental validation is required. The regions that we have identified as being increased in SNP density and at the transition of unique and duplicated sequence provide logical targets to assess this effect, especially as the genome nears completion and its quality substantially improves within these areas.

Finally, it was not our intention to intimate that the 100,000 variants underlying these duplicated regions were “useless.” The variants are, in fact, incredibly important from a practical and evolutionary perspective. Such variants have proved valuable in resolving the structure of these duplicated regions (6, 7) and in providing a baseline to begin to address such issues as positive selection and gene conversion (8). However, for the average user of dbSNP interested in using SNPs in association-based mapping studies, there is the tacit assumption that the SNP maps to a unique region in the genome. The increased density of SNPs within duplicated regions, whether they arise from errors in assembly or gene conversion, will certainly obfuscate and frustrate these types of analyses. We believe that acknowledging this potential contaminant within dbSNP, and precisely demarcating the positions of these regions which associate with duplications, constitutes a useful—indeed, an essential—first step.


  1. 1-1.
  2. 1-2.
  3. 1-3.
  4. 1-4.
  5. 1-5.
  6. 1-6.
  7. 1-7.
  8. 1-8.

Related Content

Navigate This Article